comparing api-platform-filter and main on zzstoatzz.io/leaflet-search

+84

.claude/commands/check-perf.md

··· 1 + # check performance via logfire 2 + 3 + use `mcp__logfire__arbitrary_query` with `age` in minutes (max 43200 = 30 days). 4 + 5 + note: `duration` is in seconds (DOUBLE PRECISION), multiply by 1000 for ms. 6 + 7 + ## latency percentiles by endpoint 8 + ```sql 9 + SELECT span_name, 10 + COUNT(*) as count, 11 + ROUND(PERCENTILE_CONT(0.50) WITHIN GROUP (ORDER BY duration) * 1000, 2) as p50_ms, 12 + ROUND(PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY duration) * 1000, 2) as p95_ms, 13 + ROUND(PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY duration) * 1000, 2) as p99_ms 14 + FROM records 15 + WHERE span_name LIKE 'http.%' 16 + GROUP BY span_name 17 + ORDER BY count DESC 18 + ``` 19 + 20 + ## slow requests with trace IDs 21 + ```sql 22 + SELECT span_name, duration * 1000 as ms, trace_id, start_timestamp 23 + FROM records 24 + WHERE span_name LIKE 'http.%' AND duration > 0.1 25 + ORDER BY duration DESC 26 + LIMIT 20 27 + ``` 28 + 29 + ## trace breakdown (drill into slow request) 30 + ```sql 31 + SELECT span_name, duration * 1000 as ms, message, attributes->>'sql' as sql 32 + FROM records 33 + WHERE trace_id = '<TRACE_ID>' 34 + ORDER BY start_timestamp 35 + ``` 36 + 37 + ## database comparison (turso vs local) 38 + ```sql 39 + SELECT 40 + CASE WHEN span_name = 'db.query' THEN 'turso' 41 + WHEN span_name = 'db.local.query' THEN 'local' END as db, 42 + COUNT(*) as queries, 43 + ROUND(AVG(duration) * 1000, 2) as avg_ms, 44 + ROUND(MAX(duration) * 1000, 2) as max_ms 45 + FROM records 46 + WHERE span_name IN ('db.query', 'db.local.query') 47 + GROUP BY db 48 + ``` 49 + 50 + ## recent errors 51 + ```sql 52 + SELECT start_timestamp, span_name, exception_type, exception_message 53 + FROM records 54 + WHERE exception_type IS NOT NULL 55 + ORDER BY start_timestamp DESC 56 + LIMIT 10 57 + ``` 58 + 59 + ## traffic pattern (requests per minute) 60 + ```sql 61 + SELECT date_trunc('minute', start_timestamp) as minute, 62 + COUNT(*) as requests 63 + FROM records 64 + WHERE span_name LIKE 'http.%' 65 + GROUP BY minute 66 + ORDER BY minute DESC 67 + LIMIT 30 68 + ``` 69 + 70 + ## search query distribution 71 + ```sql 72 + SELECT attributes->>'query' as query, COUNT(*) as count 73 + FROM records 74 + WHERE span_name = 'http.search' AND attributes->>'query' IS NOT NULL 75 + GROUP BY query 76 + ORDER BY count DESC 77 + LIMIT 20 78 + ``` 79 + 80 + ## typical workflow 81 + 1. run latency percentiles to get baseline 82 + 2. if p95/p99 high, find slow requests with trace IDs 83 + 3. drill into specific trace to see which child spans are slow 84 + 4. check db comparison to see if turso calls are the bottleneck

+23

.claude/commands/check-prod.md

··· 1 + # check prod health 2 + 3 + ## quick status 4 + ```bash 5 + curl -s https://leaflet-search-backend.fly.dev/health 6 + curl -s https://leaflet-search-backend.fly.dev/stats | jq 7 + ``` 8 + 9 + ## observability 10 + use the logfire MCP server to query traces and logs: 11 + - `mcp__logfire__arbitrary_query` - run SQL against traces/spans 12 + - `mcp__logfire__find_exceptions_in_file` - recent exceptions by file 13 + - `mcp__logfire__schema_reference` - see available columns 14 + 15 + ## database 16 + use turso CLI for direct SQL: 17 + ```bash 18 + turso db shell leaflet-search "SELECT COUNT(*) FROM documents" 19 + turso db shell leaflet-search "SELECT * FROM documents ORDER BY created_at DESC LIMIT 5" 20 + ``` 21 + 22 + ## tap status 23 + from `tap/` directory: `just check`

+23

.github/workflows/backfill-embeddings.yml

··· 1 + name: Backfill Embeddings 2 + 3 + on: 4 + schedule: 5 + # run daily at 6am UTC 6 + - cron: '0 6 * * *' 7 + workflow_dispatch: # allow manual trigger 8 + 9 + jobs: 10 + backfill: 11 + runs-on: ubuntu-latest 12 + steps: 13 + - uses: actions/checkout@v4 14 + 15 + - name: Install uv 16 + uses: astral-sh/setup-uv@v5 17 + 18 + - name: Run backfill 19 + env: 20 + TURSO_URL: ${{ secrets.TURSO_URL }} 21 + TURSO_TOKEN: ${{ secrets.TURSO_TOKEN }} 22 + VOYAGE_API_KEY: ${{ secrets.VOYAGE_API_KEY }} 23 + run: ./scripts/backfill-embeddings --batch-size 50

+27

.github/workflows/deploy-backend.yml

··· 1 + name: Deploy Backend 2 + 3 + on: 4 + push: 5 + branches: [main] 6 + paths: 7 + - 'backend/**' 8 + 9 + concurrency: 10 + group: backend-deploy 11 + cancel-in-progress: true 12 + 13 + env: 14 + FLY_API_TOKEN: ${{ secrets.FLY_API_TOKEN }} 15 + 16 + jobs: 17 + deploy: 18 + runs-on: ubuntu-latest 19 + steps: 20 + - uses: actions/checkout@v4 21 + 22 + - name: Setup flyctl 23 + uses: superfly/flyctl-actions/setup-flyctl@master 24 + 25 + - name: Deploy to Fly.io 26 + working-directory: backend 27 + run: flyctl deploy --remote-only

+1

.gitignore

··· 5 5 *.db 6 6 .zig-cache/ 7 7 zig-out/ 8 + .loq_cache

+33

CLAUDE.md

··· 1 + # leaflet-search notes 2 + 3 + ## deployment 4 + - **backend**: push to `main` touching `backend/**` → auto-deploys via GitHub Actions 5 + - **frontend**: manual deploy only (`wrangler pages deploy site --project-name leaflet-search`) 6 + - **tap**: manual deploy from `tap/` directory (`fly deploy --app leaflet-search-tap`) 7 + 8 + ## remotes 9 + - `origin`: tangled.sh:zzstoatzz.io/leaflet-search 10 + - `github`: github.com/zzstoatzz/leaflet-search (CI runs here) 11 + - push to both: `git push origin main && git push github main` 12 + 13 + ## architecture 14 + - **backend** (Zig): HTTP API, FTS5 search, vector similarity 15 + - **tap**: firehose sync via bluesky-social/indigo tap 16 + - **site**: static frontend on Cloudflare Pages 17 + - **db**: Turso (source of truth) + local SQLite read replica (FTS queries) 18 + 19 + ## platforms 20 + - leaflet, pckt, offprint, greengale: known platforms (detected via basePath) 21 + - other: site.standard.* documents not from a known platform 22 + 23 + ## search ranking 24 + - hybrid BM25 + recency: `ORDER BY rank + (days_old / 30)` 25 + - OR between terms for recall, prefix on last word 26 + - unicode61 tokenizer (non-alphanumeric = separator) 27 + 28 + ## tap operations 29 + - from `tap/` directory: `just check` (status), `just turbo` (catch-up), `just normal` (steady state) 30 + - see `docs/tap.md` for memory tuning and debugging 31 + 32 + ## common tasks 33 + - check indexing: `curl -s https://leaflet-search-backend.fly.dev/api/dashboard | jq`

+36 -35

README.md

··· 1 - # leaflet-search 1 + # pub search 2 2 3 3 by [@zzstoatzz.io](https://bsky.app/profile/zzstoatzz.io) 4 4 5 - search for [leaflet](https://leaflet.pub). 5 + search ATProto publishing platforms ([leaflet](https://leaflet.pub), [pckt](https://pckt.blog), [offprint](https://offprint.app), [greengale](https://greengale.app), and others using [standard.site](https://standard.site)). 6 + 7 + **live:** [pub-search.waow.tech](https://pub-search.waow.tech) 6 8 7 - **live:** [leaflet-search.pages.dev](https://leaflet-search.pages.dev) 9 + > formerly "leaflet-search" - generalized to support multiple publishing platforms 8 10 9 11 ## how it works 10 12 11 - 1. **tap** syncs leaflet content from the network 13 + 1. **[tap](https://github.com/bluesky-social/indigo/tree/main/cmd/tap)** syncs content from ATProto firehose (signals on `site.standard.document`, filters `pub.leaflet.*` + `site.standard.*`) 12 14 2. **backend** indexes content into SQLite FTS5 via [Turso](https://turso.tech), serves search API 13 15 3. **site** static frontend on Cloudflare Pages 14 16 ··· 17 19 search is also exposed as an MCP server for AI agents like Claude Code: 18 20 19 21 ```bash 20 - claude mcp add-json leaflet '{"type": "http", "url": "https://leaflet-search-by-zzstoatzz.fastmcp.app/mcp"}' 22 + claude mcp add-json pub-search '{"type": "http", "url": "https://pub-search-by-zzstoatzz.fastmcp.app/mcp"}' 21 23 ``` 22 24 23 25 see [mcp/README.md](mcp/README.md) for local setup and usage details. ··· 25 27 ## api 26 28 27 29 ``` 28 - GET /search?q=<query>&tag=<tag> # full-text search with query, tag, or both 29 - GET /similar?uri=<at-uri> # find similar documents via vector embeddings 30 - GET /tags # list all tags with counts 31 - GET /popular # popular search queries 32 - GET /stats # document/publication counts 33 - GET /health # health check 30 + GET /search?q=<query>&tag=<tag>&platform=<platform>&since=<date> # full-text search 31 + GET /similar?uri=<at-uri> # find similar documents 32 + GET /tags # list all tags with counts 33 + GET /popular # popular search queries 34 + GET /stats # counts + request latency (p50/p95) 35 + GET /health # health check 34 36 ``` 35 37 36 - search returns three entity types: `article` (document in a publication), `looseleaf` (standalone document), `publication` (newsletter itself). tag filtering applies to documents only. 38 + search returns three entity types: `article` (document in a publication), `looseleaf` (standalone document), `publication` (newsletter itself). each result includes a `platform` field (leaflet, pckt, offprint, greengale, or other). tag and platform filtering apply to documents only. 39 + 40 + **ranking**: results use hybrid BM25 + recency scoring. text relevance is primary, but recent documents get a boost (~1 point per 30 days). the `since` parameter filters to documents created after the given ISO date (e.g., `since=2025-01-01`). 37 41 38 42 `/similar` uses [Voyage AI](https://voyageai.com) embeddings with brute-force cosine similarity (~0.15s for 3500 docs). 39 43 40 - ## [stack](https://bsky.app/profile/zzstoatzz.io/post/3mbij5ip4ws2a) 44 + ## configuration 41 45 42 - - [Fly.io](https://fly.io) hosts backend + tap 43 - - [Turso](https://turso.tech) cloud SQLite with vector support 44 - - [Voyage AI](https://voyageai.com) embeddings (voyage-3-lite) 45 - - [Tap](https://github.com/bluesky-social/indigo/tree/main/cmd/tap) syncs leaflet content from ATProto firehose 46 - - [Zig](https://ziglang.org) HTTP server, search API, content indexing 47 - - [Cloudflare Pages](https://pages.cloudflare.com) static frontend 46 + the backend is fully configurable via environment variables: 48 47 49 - ## embeddings 48 + | variable | default | description | 49 + |----------|---------|-------------| 50 + | `APP_NAME` | `leaflet-search` | name shown in startup logs | 51 + | `DASHBOARD_URL` | `https://pub-search.waow.tech/dashboard.html` | redirect target for `/dashboard` | 52 + | `TAP_HOST` | `leaflet-search-tap.fly.dev` | tap websocket host | 53 + | `TAP_PORT` | `443` | tap websocket port | 54 + | `PORT` | `3000` | HTTP server port | 55 + | `TURSO_URL` | - | Turso database URL (required) | 56 + | `TURSO_TOKEN` | - | Turso auth token (required) | 57 + | `VOYAGE_API_KEY` | - | Voyage AI API key (for embeddings) | 50 58 51 - documents are embedded using Voyage AI's `voyage-3-lite` model (512 dimensions). new documents from the firehose don't automatically get embeddings - they need to be backfilled periodically. 59 + the backend indexes multiple ATProto platforms - currently `pub.leaflet.*` and `site.standard.*` collections. platform is stored per-document and returned in search results. 52 60 53 - ### backfill embeddings 61 + ## [stack](https://bsky.app/profile/zzstoatzz.io/post/3mbij5ip4ws2a) 54 62 55 - requires `TURSO_URL`, `TURSO_TOKEN`, and `VOYAGE_API_KEY` in `.env`: 63 + - [Fly.io](https://fly.io) hosts [Zig](https://ziglang.org) search API and content indexing 64 + - [Turso](https://turso.tech) cloud SQLite with [Voyage AI](https://voyageai.com) vector support 65 + - [tap](https://github.com/bluesky-social/indigo/tree/main/cmd/tap) syncs content from ATProto firehose 66 + - [Cloudflare Pages](https://pages.cloudflare.com) static frontend 56 67 57 - ```bash 58 - # check how many docs need embeddings 59 - ./scripts/backfill-embeddings --dry-run 68 + ## embeddings 60 69 61 - # run the backfill (uses batching + concurrency) 62 - ./scripts/backfill-embeddings --batch-size 50 63 - ``` 64 - 65 - the script: 66 - - fetches docs where `embedding IS NULL` 67 - - batches them to Voyage API (50 docs/batch default) 68 - - writes embeddings to Turso in batched transactions 69 - - runs 8 concurrent workers 70 + documents are embedded using Voyage AI's `voyage-3-lite` model (512 dimensions). the backend automatically generates embeddings for new documents via a background worker - no manual backfill needed. 70 71 71 72 **note:** we use brute-force cosine similarity instead of a vector index. Turso's DiskANN index has ~60s write latency per row, making it impractical for incremental updates. brute-force on 3500 vectors runs in ~0.15s which is fine for this scale.

+13

backend/build.zig

··· 19 19 .optimize = optimize, 20 20 }); 21 21 22 + const zqlite = b.dependency("zqlite", .{ 23 + .target = target, 24 + .optimize = optimize, 25 + .sqlite3 = &[_][]const u8{ "-std=c99", "-DSQLITE_ENABLE_FTS5" }, 26 + }); 27 + 28 + const logfire = b.dependency("logfire", .{ 29 + .target = target, 30 + .optimize = optimize, 31 + }); 32 + 22 33 const exe = b.addExecutable(.{ 23 34 .name = "leaflet-search", 24 35 .root_module = b.createModule(.{ ··· 29 40 .{ .name = "websocket", .module = websocket.module("websocket") }, 30 41 .{ .name = "zql", .module = zql.module("zql") }, 31 42 .{ .name = "zat", .module = zat.module("zat") }, 43 + .{ .name = "zqlite", .module = zqlite.module("zqlite") }, 44 + .{ .name = "logfire", .module = logfire.module("logfire") }, 32 45 }, 33 46 }), 34 47 });

+10 -2

backend/build.zig.zon

··· 13 13 .hash = "zql-0.0.1-alpha-xNRI4IRNAABUb9gLat5FWUaZDD5HvxAxet_-elgR_A_y", 14 14 }, 15 15 .zat = .{ 16 - .url = "https://tangled.sh/zzstoatzz.io/zat/archive/main", 17 - .hash = "zat-0.1.0-5PuC7ntmAQA9_8rALQwWad2riXWTY9p_ohVOD54_Y-2c", 16 + .url = "https://tangled.sh/zat.dev/zat/archive/main", 17 + .hash = "zat-0.1.0-5PuC7tmRAQCPov0UFkeXFBgfytd6_P3GZPSlvy9OvvgW", 18 + }, 19 + .zqlite = .{ 20 + .url = "https://github.com/karlseguin/zqlite.zig/archive/refs/heads/master.tar.gz", 21 + .hash = "zqlite-0.0.0-RWLaY_y_mADh2LdbDrG_2HT2dBAcsAR8Jig_7-dOJd0B", 22 + }, 23 + .logfire = .{ 24 + .url = "https://tangled.sh/zzstoatzz.io/logfire-zig/archive/main", 25 + .hash = "logfire_zig-0.1.0-x2yDLgdwAABPUnnE_Smk0Rjrf93qyC5vcJXzvieuNFWI", 18 26 }, 19 27 }, 20 28 .paths = .{

+6 -2

backend/fly.toml

··· 10 10 [http_service] 11 11 internal_port = 3000 12 12 force_https = true 13 - auto_stop_machines = 'stop' 13 + auto_stop_machines = 'off' 14 14 auto_start_machines = true 15 15 min_machines_running = 1 16 16 processes = ['app'] 17 17 18 + [mounts] 19 + source = 'leaflet_data' 20 + destination = '/data' 21 + 18 22 [[vm]] 19 - memory = '256mb' 23 + memory = '512mb' 20 24 cpu_kind = 'shared' 21 25 cpus = 1

+195 -18

backend/src/dashboard.zig

··· 2 2 const json = std.json; 3 3 const Allocator = std.mem.Allocator; 4 4 const db = @import("db/mod.zig"); 5 + const timing = @import("timing.zig"); 5 6 6 7 // JSON output types 7 8 const TagJson = struct { tag: []const u8, count: i64 }; 8 9 const TimelineJson = struct { date: []const u8, count: i64 }; 9 10 const PubJson = struct { name: []const u8, basePath: []const u8, count: i64 }; 11 + const PlatformJson = struct { platform: []const u8, count: i64 }; 10 12 11 13 /// All data needed to render the dashboard 12 14 pub const Data = struct { 13 15 started_at: i64, 14 16 searches: i64, 15 17 publications: i64, 16 - articles: i64, 17 - looseleafs: i64, 18 + documents: i64, 18 19 tags_json: []const u8, 19 20 timeline_json: []const u8, 20 21 top_pubs_json: []const u8, 22 + platforms_json: []const u8, 23 + timing_json: []const u8, 21 24 }; 22 25 23 26 // all dashboard queries batched into one request ··· 30 33 \\ (SELECT service_started_at FROM stats WHERE id = 1) as started_at 31 34 ; 32 35 33 - const DOC_TYPES_SQL = 34 - \\SELECT 35 - \\ SUM(CASE WHEN publication_uri != '' THEN 1 ELSE 0 END) as articles, 36 - \\ SUM(CASE WHEN publication_uri = '' OR publication_uri IS NULL THEN 1 ELSE 0 END) as looseleafs 36 + const PLATFORMS_SQL = 37 + \\SELECT platform, COUNT(*) as count 37 38 \\FROM documents 39 + \\GROUP BY platform 40 + \\ORDER BY count DESC 38 41 ; 39 42 40 43 const TAGS_SQL = ··· 64 67 ; 65 68 66 69 pub fn fetch(alloc: Allocator) !Data { 70 + // try local SQLite first (fast) 71 + if (db.getLocalDb()) |local| { 72 + if (fetchLocal(alloc, local)) |result| { 73 + return result; 74 + } else |_| {} 75 + } 76 + 77 + // fall back to Turso (slow) 67 78 const client = db.getClient() orelse return error.NotInitialized; 68 79 69 80 // batch all 5 queries into one HTTP request 70 81 var batch = client.queryBatch(&.{ 71 82 .{ .sql = STATS_SQL }, 72 - .{ .sql = DOC_TYPES_SQL }, 83 + .{ .sql = PLATFORMS_SQL }, 73 84 .{ .sql = TAGS_SQL }, 74 85 .{ .sql = TIMELINE_SQL }, 75 86 .{ .sql = TOP_PUBS_SQL }, ··· 81 92 const started_at = if (stats_row) |r| r.int(4) else 0; 82 93 const searches = if (stats_row) |r| r.int(2) else 0; 83 94 const publications = if (stats_row) |r| r.int(1) else 0; 84 - 85 - // extract doc types (query 1) 86 - const doc_row = batch.getFirst(1); 87 - const articles = if (doc_row) |r| r.int(0) else 0; 88 - const looseleafs = if (doc_row) |r| r.int(1) else 0; 95 + const documents = if (stats_row) |r| r.int(0) else 0; 89 96 90 97 return .{ 91 98 .started_at = started_at, 92 99 .searches = searches, 93 100 .publications = publications, 94 - .articles = articles, 95 - .looseleafs = looseleafs, 101 + .documents = documents, 96 102 .tags_json = try formatTagsJson(alloc, batch.get(2)), 97 103 .timeline_json = try formatTimelineJson(alloc, batch.get(3)), 98 104 .top_pubs_json = try formatPubsJson(alloc, batch.get(4)), 105 + .platforms_json = try formatPlatformsJson(alloc, batch.get(1)), 106 + .timing_json = try formatTimingJson(alloc), 99 107 }; 100 108 } 101 109 110 + fn fetchLocal(alloc: Allocator, local: *db.LocalDb) !Data { 111 + // get stats from Turso (searches/started_at don't sync to local replica) 112 + const client = db.getClient() orelse return error.NotInitialized; 113 + var stats_res = client.query( 114 + \\SELECT total_searches, service_started_at FROM stats WHERE id = 1 115 + , &.{}) catch return error.QueryFailed; 116 + defer stats_res.deinit(); 117 + const turso_stats = stats_res.first(); 118 + const searches = if (turso_stats) |r| r.int(0) else 0; 119 + const started_at = if (turso_stats) |r| r.int(1) else 0; 120 + 121 + // get document/publication counts from local (fast) 122 + var counts_rows = try local.query( 123 + \\SELECT 124 + \\ (SELECT COUNT(*) FROM documents) as docs, 125 + \\ (SELECT COUNT(*) FROM publications) as pubs 126 + , .{}); 127 + defer counts_rows.deinit(); 128 + const counts_row = counts_rows.next() orelse return error.NoStats; 129 + const documents = counts_row.int(0); 130 + const publications = counts_row.int(1); 131 + 132 + // platforms query 133 + var platforms_rows = try local.query(PLATFORMS_SQL, .{}); 134 + defer platforms_rows.deinit(); 135 + const platforms_json = try formatPlatformsJsonLocal(alloc, &platforms_rows); 136 + 137 + // tags query 138 + var tags_rows = try local.query(TAGS_SQL, .{}); 139 + defer tags_rows.deinit(); 140 + const tags_json = try formatTagsJsonLocal(alloc, &tags_rows); 141 + 142 + // timeline query 143 + var timeline_rows = try local.query(TIMELINE_SQL, .{}); 144 + defer timeline_rows.deinit(); 145 + const timeline_json = try formatTimelineJsonLocal(alloc, &timeline_rows); 146 + 147 + // top pubs query 148 + var pubs_rows = try local.query(TOP_PUBS_SQL, .{}); 149 + defer pubs_rows.deinit(); 150 + const top_pubs_json = try formatPubsJsonLocal(alloc, &pubs_rows); 151 + 152 + return .{ 153 + .started_at = started_at, 154 + .searches = searches, 155 + .publications = publications, 156 + .documents = documents, 157 + .tags_json = tags_json, 158 + .timeline_json = timeline_json, 159 + .top_pubs_json = top_pubs_json, 160 + .platforms_json = platforms_json, 161 + .timing_json = try formatTimingJson(alloc), 162 + }; 163 + } 164 + 165 + fn formatTagsJsonLocal(alloc: Allocator, rows: *db.LocalDb.Rows) ![]const u8 { 166 + var output: std.Io.Writer.Allocating = .init(alloc); 167 + errdefer output.deinit(); 168 + var jw: json.Stringify = .{ .writer = &output.writer }; 169 + try jw.beginArray(); 170 + while (rows.next()) |row| { 171 + try jw.write(TagJson{ .tag = row.text(0), .count = row.int(1) }); 172 + } 173 + try jw.endArray(); 174 + return try output.toOwnedSlice(); 175 + } 176 + 177 + fn formatTimelineJsonLocal(alloc: Allocator, rows: *db.LocalDb.Rows) ![]const u8 { 178 + var output: std.Io.Writer.Allocating = .init(alloc); 179 + errdefer output.deinit(); 180 + var jw: json.Stringify = .{ .writer = &output.writer }; 181 + try jw.beginArray(); 182 + while (rows.next()) |row| { 183 + try jw.write(TimelineJson{ .date = row.text(0), .count = row.int(1) }); 184 + } 185 + try jw.endArray(); 186 + return try output.toOwnedSlice(); 187 + } 188 + 189 + fn formatPubsJsonLocal(alloc: Allocator, rows: *db.LocalDb.Rows) ![]const u8 { 190 + var output: std.Io.Writer.Allocating = .init(alloc); 191 + errdefer output.deinit(); 192 + var jw: json.Stringify = .{ .writer = &output.writer }; 193 + try jw.beginArray(); 194 + while (rows.next()) |row| { 195 + try jw.write(PubJson{ .name = row.text(0), .basePath = row.text(1), .count = row.int(2) }); 196 + } 197 + try jw.endArray(); 198 + return try output.toOwnedSlice(); 199 + } 200 + 201 + fn formatPlatformsJsonLocal(alloc: Allocator, rows: *db.LocalDb.Rows) ![]const u8 { 202 + var output: std.Io.Writer.Allocating = .init(alloc); 203 + errdefer output.deinit(); 204 + var jw: json.Stringify = .{ .writer = &output.writer }; 205 + try jw.beginArray(); 206 + while (rows.next()) |row| { 207 + try jw.write(PlatformJson{ .platform = row.text(0), .count = row.int(1) }); 208 + } 209 + try jw.endArray(); 210 + return try output.toOwnedSlice(); 211 + } 212 + 102 213 fn formatTagsJson(alloc: Allocator, rows: []const db.Row) ![]const u8 { 103 214 var output: std.Io.Writer.Allocating = .init(alloc); 104 215 errdefer output.deinit(); ··· 129 240 return try output.toOwnedSlice(); 130 241 } 131 242 243 + fn formatPlatformsJson(alloc: Allocator, rows: []const db.Row) ![]const u8 { 244 + var output: std.Io.Writer.Allocating = .init(alloc); 245 + errdefer output.deinit(); 246 + var jw: json.Stringify = .{ .writer = &output.writer }; 247 + try jw.beginArray(); 248 + for (rows) |row| try jw.write(PlatformJson{ .platform = row.text(0), .count = row.int(1) }); 249 + try jw.endArray(); 250 + return try output.toOwnedSlice(); 251 + } 252 + 253 + fn formatTimingJson(alloc: Allocator) ![]const u8 { 254 + const all_timing = timing.getAllStats(); 255 + const all_series = timing.getAllTimeSeries(); 256 + 257 + var output: std.Io.Writer.Allocating = .init(alloc); 258 + errdefer output.deinit(); 259 + var jw: json.Stringify = .{ .writer = &output.writer }; 260 + 261 + try jw.beginObject(); 262 + inline for (@typeInfo(timing.Endpoint).@"enum".fields, 0..) |field, i| { 263 + const t = all_timing[i]; 264 + const series = all_series[i]; 265 + try jw.objectField(field.name); 266 + try jw.beginObject(); 267 + try jw.objectField("count"); 268 + try jw.write(t.count); 269 + try jw.objectField("avg_ms"); 270 + try jw.write(t.avg_ms); 271 + try jw.objectField("p50_ms"); 272 + try jw.write(t.p50_ms); 273 + try jw.objectField("p95_ms"); 274 + try jw.write(t.p95_ms); 275 + try jw.objectField("p99_ms"); 276 + try jw.write(t.p99_ms); 277 + try jw.objectField("max_ms"); 278 + try jw.write(t.max_ms); 279 + // add 24h time series 280 + try jw.objectField("history"); 281 + try jw.beginArray(); 282 + for (series) |point| { 283 + try jw.beginObject(); 284 + try jw.objectField("hour"); 285 + try jw.write(point.hour); 286 + try jw.objectField("count"); 287 + try jw.write(point.count); 288 + try jw.objectField("avg_ms"); 289 + try jw.write(point.avg_ms); 290 + try jw.objectField("max_ms"); 291 + try jw.write(point.max_ms); 292 + try jw.endObject(); 293 + } 294 + try jw.endArray(); 295 + try jw.endObject(); 296 + } 297 + try jw.endObject(); 298 + 299 + return try output.toOwnedSlice(); 300 + } 301 + 132 302 /// Generate dashboard data as JSON for API endpoint 133 303 pub fn toJson(alloc: Allocator, data: Data) ![]const u8 { 134 304 var output: std.Io.Writer.Allocating = .init(alloc); ··· 146 316 try jw.objectField("publications"); 147 317 try jw.write(data.publications); 148 318 149 - try jw.objectField("articles"); 150 - try jw.write(data.articles); 319 + try jw.objectField("documents"); 320 + try jw.write(data.documents); 151 321 152 - try jw.objectField("looseleafs"); 153 - try jw.write(data.looseleafs); 322 + try jw.objectField("platforms"); 323 + try jw.beginWriteRaw(); 324 + try jw.writer.writeAll(data.platforms_json); 325 + jw.endWriteRaw(); 154 326 155 327 // use beginWriteRaw/endWriteRaw for pre-formatted JSON arrays 156 328 try jw.objectField("tags"); ··· 166 338 try jw.objectField("topPubs"); 167 339 try jw.beginWriteRaw(); 168 340 try jw.writer.writeAll(data.top_pubs_json); 341 + jw.endWriteRaw(); 342 + 343 + try jw.objectField("timing"); 344 + try jw.beginWriteRaw(); 345 + try jw.writer.writeAll(data.timing_json); 169 346 jw.endWriteRaw(); 170 347 171 348 try jw.endObject();

+29 -4

backend/src/db/Client.zig

··· 6 6 const json = std.json; 7 7 const mem = std.mem; 8 8 const Allocator = mem.Allocator; 9 + const logfire = @import("logfire"); 9 10 10 11 const result = @import("result.zig"); 11 12 pub const Result = result.Result; ··· 116 117 } 117 118 118 119 fn executeRaw(self: *Client, sql: []const u8, args: []const []const u8) ![]const u8 { 120 + const span = logfire.span("db.query", .{ 121 + .sql = truncateSql(sql), 122 + .args_count = @as(i64, @intCast(args.len)), 123 + }); 124 + defer span.end(); 125 + 119 126 self.mutex.lock(); 120 127 defer self.mutex.unlock(); 121 128 ··· 143 150 .payload = body, 144 151 .response_writer = &response_body.writer, 145 152 }) catch |err| { 146 - std.debug.print("turso request failed: {}\n", .{err}); 153 + logfire.err("db.query http failed: {s} | sql: {s}", .{ @errorName(err), truncateSql(sql) }); 147 154 return error.HttpError; 148 155 }; 149 156 150 157 if (res.status != .ok) { 151 - std.debug.print("turso error: {}\n", .{res.status}); 158 + const resp_text = response_body.toOwnedSlice() catch ""; 159 + defer if (resp_text.len > 0) self.allocator.free(resp_text); 160 + const resp_preview = if (resp_text.len > 200) resp_text[0..200] else resp_text; 161 + logfire.err("db.query turso error: {} | sql: {s} | response: {s}", .{ res.status, truncateSql(sql), resp_preview }); 152 162 return error.TursoError; 153 163 } 154 164 ··· 156 166 } 157 167 158 168 fn executeBatchRaw(self: *Client, statements: []const Statement) ![]const u8 { 169 + const first_sql = if (statements.len > 0) truncateSql(statements[0].sql) else ""; 170 + const span = logfire.span("db.batch", .{ 171 + .statement_count = @as(i64, @intCast(statements.len)), 172 + .first_sql = first_sql, 173 + }); 174 + defer span.end(); 175 + 159 176 self.mutex.lock(); 160 177 defer self.mutex.unlock(); 161 178 ··· 183 200 .payload = body, 184 201 .response_writer = &response_body.writer, 185 202 }) catch |err| { 186 - std.debug.print("turso batch request failed: {}\n", .{err}); 203 + logfire.err("db.batch http failed: {s} | first_sql: {s}", .{ @errorName(err), first_sql }); 187 204 return error.HttpError; 188 205 }; 189 206 190 207 if (res.status != .ok) { 191 - std.debug.print("turso batch error: {}\n", .{res.status}); 208 + const resp_text = response_body.toOwnedSlice() catch ""; 209 + defer if (resp_text.len > 0) self.allocator.free(resp_text); 210 + const resp_preview = if (resp_text.len > 200) resp_text[0..200] else resp_text; 211 + logfire.err("db.batch turso error: {} | first_sql: {s} | response: {s}", .{ res.status, first_sql, resp_preview }); 192 212 return error.TursoError; 193 213 } 194 214 ··· 284 304 285 305 return 0; 286 306 } 307 + 308 + fn truncateSql(sql: []const u8) []const u8 { 309 + const max_len = 100; 310 + return if (sql.len > max_len) sql[0..max_len] else sql; 311 + }

+337

backend/src/db/LocalDb.zig

··· 1 + //! Local SQLite read replica using zqlite 2 + //! Provides fast FTS5 queries while Turso remains source of truth 3 + 4 + const std = @import("std"); 5 + const posix = std.posix; 6 + const zqlite = @import("zqlite"); 7 + const Allocator = std.mem.Allocator; 8 + const logfire = @import("logfire"); 9 + 10 + const LocalDb = @This(); 11 + 12 + conn: ?zqlite.Conn = null, 13 + allocator: Allocator, 14 + is_ready: std.atomic.Value(bool) = std.atomic.Value(bool).init(false), 15 + needs_resync: std.atomic.Value(bool) = std.atomic.Value(bool).init(false), 16 + mutex: std.Thread.Mutex = .{}, 17 + path: []const u8 = "", 18 + consecutive_errors: std.atomic.Value(u32) = std.atomic.Value(u32).init(0), 19 + 20 + pub fn init(allocator: Allocator) LocalDb { 21 + return .{ .allocator = allocator }; 22 + } 23 + 24 + /// Check database integrity and return false if corrupt 25 + fn checkIntegrity(self: *LocalDb) bool { 26 + const c = self.conn orelse return false; 27 + const row = c.row("PRAGMA integrity_check", .{}) catch return false; 28 + if (row) |r| { 29 + defer r.deinit(); 30 + const result = r.text(0); 31 + if (std.mem.eql(u8, result, "ok")) { 32 + return true; 33 + } 34 + std.debug.print("local db: integrity check failed: {s}\n", .{result}); 35 + return false; 36 + } 37 + return false; 38 + } 39 + 40 + /// Delete the database file and WAL/SHM files 41 + fn deleteDbFiles(path: []const u8) void { 42 + std.fs.cwd().deleteFile(path) catch {}; 43 + // also delete WAL and SHM files 44 + var wal_buf: [260]u8 = undefined; 45 + var shm_buf: [260]u8 = undefined; 46 + if (path.len < 252) { 47 + const wal_path = std.fmt.bufPrint(&wal_buf, "{s}-wal", .{path}) catch return; 48 + const shm_path = std.fmt.bufPrint(&shm_buf, "{s}-shm", .{path}) catch return; 49 + std.fs.cwd().deleteFile(wal_path) catch {}; 50 + std.fs.cwd().deleteFile(shm_path) catch {}; 51 + } 52 + } 53 + 54 + pub fn open(self: *LocalDb) !void { 55 + const path_env = posix.getenv("LOCAL_DB_PATH") orelse "/data/local.db"; 56 + self.path = path_env; 57 + 58 + try self.openDb(path_env, false); 59 + } 60 + 61 + fn openDb(self: *LocalDb, path_env: []const u8, is_retry: bool) !void { 62 + // convert to null-terminated for zqlite 63 + var path_buf: [256]u8 = undefined; 64 + if (path_env.len >= path_buf.len) return error.PathTooLong; 65 + @memcpy(path_buf[0..path_env.len], path_env); 66 + path_buf[path_env.len] = 0; 67 + const path: [*:0]const u8 = path_buf[0..path_env.len :0]; 68 + 69 + std.debug.print("local db: opening {s}\n", .{path_env}); 70 + 71 + const flags = zqlite.OpenFlags.Create | zqlite.OpenFlags.ReadWrite; 72 + self.conn = zqlite.open(path, flags) catch |err| { 73 + std.debug.print("local db: failed to open: {}\n", .{err}); 74 + return err; 75 + }; 76 + 77 + // enable WAL for better concurrency 78 + _ = self.conn.?.exec("PRAGMA journal_mode=WAL", .{}) catch {}; 79 + _ = self.conn.?.exec("PRAGMA busy_timeout=5000", .{}) catch {}; 80 + 81 + // check integrity - if corrupt, delete and recreate 82 + if (!self.checkIntegrity()) { 83 + if (is_retry) { 84 + std.debug.print("local db: still corrupt after recreation, giving up\n", .{}); 85 + return error.DatabaseCorrupt; 86 + } 87 + std.debug.print("local db: corrupt, deleting and recreating\n", .{}); 88 + if (self.conn) |c| c.close(); 89 + self.conn = null; 90 + deleteDbFiles(path_env); 91 + return self.openDb(path_env, true); 92 + } 93 + 94 + try self.createSchema(); 95 + std.debug.print("local db: initialized\n", .{}); 96 + } 97 + 98 + pub fn deinit(self: *LocalDb) void { 99 + if (self.conn) |c| c.close(); 100 + self.conn = null; 101 + } 102 + 103 + pub fn isReady(self: *LocalDb) bool { 104 + return self.is_ready.load(.acquire); 105 + } 106 + 107 + pub fn setReady(self: *LocalDb, ready: bool) void { 108 + self.is_ready.store(ready, .release); 109 + } 110 + 111 + fn createSchema(self: *LocalDb) !void { 112 + const c = self.conn orelse return error.NotOpen; 113 + 114 + // documents table (no embedding column - vectors stay on Turso) 115 + c.exec( 116 + \\CREATE TABLE IF NOT EXISTS documents ( 117 + \\ uri TEXT PRIMARY KEY, 118 + \\ did TEXT NOT NULL, 119 + \\ rkey TEXT NOT NULL, 120 + \\ title TEXT NOT NULL, 121 + \\ content TEXT NOT NULL, 122 + \\ created_at TEXT, 123 + \\ publication_uri TEXT, 124 + \\ platform TEXT DEFAULT 'leaflet', 125 + \\ source_collection TEXT, 126 + \\ path TEXT, 127 + \\ base_path TEXT DEFAULT '', 128 + \\ has_publication INTEGER DEFAULT 0 129 + \\) 130 + , .{}) catch |err| { 131 + std.debug.print("local db: failed to create documents table: {}\n", .{err}); 132 + return err; 133 + }; 134 + 135 + // FTS5 index (unicode61 tokenizer to match Turso) 136 + c.exec( 137 + \\CREATE VIRTUAL TABLE IF NOT EXISTS documents_fts USING fts5( 138 + \\ uri UNINDEXED, 139 + \\ title, 140 + \\ content, 141 + \\ tokenize='unicode61' 142 + \\) 143 + , .{}) catch |err| { 144 + std.debug.print("local db: failed to create documents_fts: {}\n", .{err}); 145 + return err; 146 + }; 147 + 148 + // publications table (no created_at - matches Turso schema) 149 + c.exec( 150 + \\CREATE TABLE IF NOT EXISTS publications ( 151 + \\ uri TEXT PRIMARY KEY, 152 + \\ did TEXT NOT NULL, 153 + \\ rkey TEXT NOT NULL, 154 + \\ name TEXT NOT NULL, 155 + \\ description TEXT, 156 + \\ base_path TEXT, 157 + \\ platform TEXT DEFAULT 'leaflet', 158 + \\ source_collection TEXT 159 + \\) 160 + , .{}) catch |err| { 161 + std.debug.print("local db: failed to create publications table: {}\n", .{err}); 162 + return err; 163 + }; 164 + 165 + // publications FTS 166 + c.exec( 167 + \\CREATE VIRTUAL TABLE IF NOT EXISTS publications_fts USING fts5( 168 + \\ uri UNINDEXED, 169 + \\ name, 170 + \\ description, 171 + \\ base_path, 172 + \\ tokenize='unicode61' 173 + \\) 174 + , .{}) catch |err| { 175 + std.debug.print("local db: failed to create publications_fts: {}\n", .{err}); 176 + return err; 177 + }; 178 + 179 + // document_tags table 180 + c.exec( 181 + \\CREATE TABLE IF NOT EXISTS document_tags ( 182 + \\ document_uri TEXT NOT NULL, 183 + \\ tag TEXT NOT NULL, 184 + \\ PRIMARY KEY (document_uri, tag) 185 + \\) 186 + , .{}) catch |err| { 187 + std.debug.print("local db: failed to create document_tags table: {}\n", .{err}); 188 + return err; 189 + }; 190 + 191 + // index for tag queries 192 + c.exec("CREATE INDEX IF NOT EXISTS idx_document_tags_tag ON document_tags(tag)", .{}) catch {}; 193 + 194 + // sync metadata table 195 + c.exec( 196 + \\CREATE TABLE IF NOT EXISTS sync_meta ( 197 + \\ key TEXT PRIMARY KEY, 198 + \\ value TEXT 199 + \\) 200 + , .{}) catch |err| { 201 + std.debug.print("local db: failed to create sync_meta table: {}\n", .{err}); 202 + return err; 203 + }; 204 + 205 + // stats table for local counters 206 + c.exec( 207 + \\CREATE TABLE IF NOT EXISTS stats ( 208 + \\ id INTEGER PRIMARY KEY CHECK (id = 1), 209 + \\ total_searches INTEGER DEFAULT 0, 210 + \\ total_errors INTEGER DEFAULT 0, 211 + \\ service_started_at INTEGER 212 + \\) 213 + , .{}) catch {}; 214 + c.exec("INSERT OR IGNORE INTO stats (id) VALUES (1)", .{}) catch {}; 215 + 216 + // popular searches 217 + c.exec( 218 + \\CREATE TABLE IF NOT EXISTS popular_searches ( 219 + \\ query TEXT PRIMARY KEY, 220 + \\ count INTEGER DEFAULT 1 221 + \\) 222 + , .{}) catch {}; 223 + 224 + // similarity cache (local copy for fast lookups) 225 + c.exec( 226 + \\CREATE TABLE IF NOT EXISTS similarity_cache ( 227 + \\ source_uri TEXT PRIMARY KEY, 228 + \\ results TEXT NOT NULL, 229 + \\ doc_count INTEGER NOT NULL, 230 + \\ computed_at INTEGER NOT NULL 231 + \\) 232 + , .{}) catch {}; 233 + } 234 + 235 + /// Row adapter matching result.Row interface (column-indexed access) 236 + pub const Row = struct { 237 + stmt: zqlite.Row, 238 + 239 + pub fn text(self: Row, index: usize) []const u8 { 240 + return self.stmt.text(index); 241 + } 242 + 243 + pub fn int(self: Row, index: usize) i64 { 244 + return self.stmt.int(index); 245 + } 246 + }; 247 + 248 + /// Iterator for query results 249 + pub const Rows = struct { 250 + inner: zqlite.Rows, 251 + 252 + pub fn next(self: *Rows) ?Row { 253 + if (self.inner.next()) |r| { 254 + return .{ .stmt = r }; 255 + } 256 + return null; 257 + } 258 + 259 + pub fn deinit(self: *Rows) void { 260 + self.inner.deinit(); 261 + } 262 + 263 + pub fn err(self: *Rows) ?anyerror { 264 + return self.inner.err; 265 + } 266 + }; 267 + 268 + /// Execute a SELECT query with comptime SQL, returns row iterator 269 + pub fn query(self: *LocalDb, comptime sql: []const u8, args: anytype) !Rows { 270 + const span = logfire.span("db.local.query", .{ 271 + .sql = truncateSql(sql), 272 + }); 273 + defer span.end(); 274 + 275 + self.mutex.lock(); 276 + defer self.mutex.unlock(); 277 + 278 + const c = self.conn orelse return error.NotOpen; 279 + const rows = c.rows(sql, args) catch |e| { 280 + logfire.err("db.local.query failed: {s} | sql: {s}", .{ @errorName(e), truncateSql(sql) }); 281 + return e; 282 + }; 283 + return .{ .inner = rows }; 284 + } 285 + 286 + /// Execute a SELECT query expecting single row 287 + pub fn queryOne(self: *LocalDb, comptime sql: []const u8, args: anytype) !?Row { 288 + const span = logfire.span("db.local.query", .{ 289 + .sql = truncateSql(sql), 290 + }); 291 + defer span.end(); 292 + 293 + self.mutex.lock(); 294 + defer self.mutex.unlock(); 295 + 296 + const c = self.conn orelse return error.NotOpen; 297 + const row = c.row(sql, args) catch |e| { 298 + logfire.err("db.local.queryOne failed: {s} | sql: {s}", .{ @errorName(e), truncateSql(sql) }); 299 + return e; 300 + }; 301 + if (row) |r| { 302 + return .{ .stmt = r }; 303 + } 304 + return null; 305 + } 306 + 307 + /// Execute a statement (INSERT, UPDATE, DELETE) 308 + pub fn exec(self: *LocalDb, comptime sql: []const u8, args: anytype) !void { 309 + self.mutex.lock(); 310 + defer self.mutex.unlock(); 311 + 312 + const c = self.conn orelse return error.NotOpen; 313 + c.exec(sql, args) catch |e| { 314 + logfire.err("db.local.exec failed: {s} | sql: {s}", .{ @errorName(e), truncateSql(sql) }); 315 + return e; 316 + }; 317 + } 318 + 319 + /// Get raw connection for batch operations (caller must handle locking) 320 + pub fn getConn(self: *LocalDb) ?zqlite.Conn { 321 + return self.conn; 322 + } 323 + 324 + /// Lock for batch operations 325 + pub fn lock(self: *LocalDb) void { 326 + self.mutex.lock(); 327 + } 328 + 329 + /// Unlock after batch operations 330 + pub fn unlock(self: *LocalDb) void { 331 + self.mutex.unlock(); 332 + } 333 + 334 + fn truncateSql(sql: []const u8) []const u8 { 335 + const max_len = 100; 336 + return if (sql.len > max_len) sql[0..max_len] else sql; 337 + }

+87 -1

backend/src/db/mod.zig

··· 1 1 const std = @import("std"); 2 + const posix = std.posix; 2 3 3 4 const schema = @import("schema.zig"); 4 5 const result = @import("result.zig"); 6 + const sync = @import("sync.zig"); 5 7 6 8 // re-exports 7 9 pub const Client = @import("Client.zig"); 10 + pub const LocalDb = @import("LocalDb.zig"); 8 11 pub const Row = result.Row; 9 12 pub const Result = result.Result; 10 13 pub const BatchResult = result.BatchResult; ··· 12 15 // global state 13 16 var gpa: std.heap.GeneralPurposeAllocator(.{}) = .{}; 14 17 var client: ?Client = null; 18 + var local_db: ?LocalDb = null; 15 19 16 - pub fn init() !void { 20 + /// Initialize Turso client only (fast, call synchronously at startup) 21 + pub fn initTurso() !void { 17 22 client = try Client.init(gpa.allocator()); 18 23 try schema.init(&client.?); 19 24 } 20 25 26 + /// Initialize local SQLite replica (slow, call in background thread) 27 + pub fn initLocalDb() void { 28 + initLocal() catch |err| { 29 + std.debug.print("local db init failed (will use turso only): {}\n", .{err}); 30 + }; 31 + } 32 + 33 + pub fn init() !void { 34 + try initTurso(); 35 + initLocalDb(); 36 + } 37 + 38 + fn initLocal() !void { 39 + // check if local db is disabled 40 + if (posix.getenv("LOCAL_DB_ENABLED")) |val| { 41 + if (std.mem.eql(u8, val, "false") or std.mem.eql(u8, val, "0")) { 42 + std.debug.print("local db disabled via LOCAL_DB_ENABLED\n", .{}); 43 + return; 44 + } 45 + } 46 + 47 + local_db = LocalDb.init(gpa.allocator()); 48 + try local_db.?.open(); 49 + } 50 + 21 51 pub fn getClient() ?*Client { 22 52 if (client) |*c| return c; 23 53 return null; 24 54 } 55 + 56 + /// Get local db if ready (synced and available) 57 + pub fn getLocalDb() ?*LocalDb { 58 + if (local_db) |*l| { 59 + if (l.isReady()) return l; 60 + } 61 + return null; 62 + } 63 + 64 + /// Get local db even if not ready (for sync operations) 65 + pub fn getLocalDbRaw() ?*LocalDb { 66 + if (local_db) |*l| return l; 67 + return null; 68 + } 69 + 70 + /// Start background sync thread (call from main after db.init) 71 + pub fn startSync() void { 72 + const c = getClient() orelse { 73 + std.debug.print("sync: no turso client, skipping\n", .{}); 74 + return; 75 + }; 76 + const local = getLocalDbRaw() orelse { 77 + std.debug.print("sync: no local db, skipping\n", .{}); 78 + return; 79 + }; 80 + 81 + const thread = std.Thread.spawn(.{}, syncLoop, .{ c, local }) catch |err| { 82 + std.debug.print("sync: failed to start thread: {}\n", .{err}); 83 + return; 84 + }; 85 + thread.detach(); 86 + std.debug.print("sync: background thread started\n", .{}); 87 + } 88 + 89 + fn syncLoop(turso: *Client, local: *LocalDb) void { 90 + // full sync on startup 91 + sync.fullSync(turso, local) catch |err| { 92 + std.debug.print("sync: initial full sync failed: {}\n", .{err}); 93 + }; 94 + 95 + // get sync interval from env (default 5 minutes) 96 + const interval_secs: u64 = blk: { 97 + const env_val = posix.getenv("SYNC_INTERVAL_SECS") orelse "300"; 98 + break :blk std.fmt.parseInt(u64, env_val, 10) catch 300; 99 + }; 100 + 101 + std.debug.print("sync: incremental sync every {d} seconds\n", .{interval_secs}); 102 + 103 + // periodic incremental sync 104 + while (true) { 105 + std.Thread.sleep(interval_secs * std.time.ns_per_s); 106 + sync.incrementalSync(turso, local) catch |err| { 107 + std.debug.print("sync: incremental sync failed: {}\n", .{err}); 108 + }; 109 + } 110 + }

+105 -1

backend/src/db/schema.zig

··· 44 44 \\CREATE VIRTUAL TABLE IF NOT EXISTS publications_fts USING fts5( 45 45 \\ uri UNINDEXED, 46 46 \\ name, 47 - \\ description 47 + \\ description, 48 + \\ base_path 48 49 \\) 49 50 , &.{}); 50 51 ··· 127 128 client.exec("UPDATE documents SET platform = 'leaflet' WHERE platform IS NULL", &.{}) catch {}; 128 129 client.exec("UPDATE documents SET source_collection = 'pub.leaflet.document' WHERE source_collection IS NULL", &.{}) catch {}; 129 130 131 + // multi-platform support for publications 132 + client.exec("ALTER TABLE publications ADD COLUMN platform TEXT DEFAULT 'leaflet'", &.{}) catch {}; 133 + client.exec("ALTER TABLE publications ADD COLUMN source_collection TEXT DEFAULT 'pub.leaflet.publication'", &.{}) catch {}; 134 + client.exec("UPDATE publications SET platform = 'leaflet' WHERE platform IS NULL", &.{}) catch {}; 135 + client.exec("UPDATE publications SET source_collection = 'pub.leaflet.publication' WHERE source_collection IS NULL", &.{}) catch {}; 136 + 130 137 // vector embeddings column already added by backfill script 138 + 139 + // dedupe index: same (did, rkey) across collections = same document 140 + // e.g., pub.leaflet.document/abc and site.standard.document/abc are the same content 141 + client.exec("CREATE UNIQUE INDEX IF NOT EXISTS idx_documents_did_rkey ON documents(did, rkey)", &.{}) catch {}; 142 + client.exec("CREATE UNIQUE INDEX IF NOT EXISTS idx_publications_did_rkey ON publications(did, rkey)", &.{}) catch {}; 143 + 144 + // backfill platform from source_collection for records indexed before platform detection fix 145 + client.exec("UPDATE documents SET platform = 'leaflet' WHERE platform = 'unknown' AND source_collection LIKE 'pub.leaflet.%'", &.{}) catch {}; 146 + client.exec("UPDATE documents SET platform = 'pckt' WHERE platform = 'unknown' AND source_collection LIKE 'blog.pckt.%'", &.{}) catch {}; 147 + 148 + // rename 'standardsite' to 'other' (standardsite was a misnomer - it's a lexicon, not a platform) 149 + // documents using site.standard.* that don't match a known platform are simply "other" 150 + client.exec("UPDATE documents SET platform = 'other' WHERE platform = 'standardsite'", &.{}) catch {}; 151 + 152 + // detect platform from publication basePath (site.standard.* is a lexicon, not a platform) 153 + // known platforms (pckt, leaflet, offprint) use site.standard.* but have distinct basePaths 154 + client.exec( 155 + \\UPDATE documents SET platform = 'pckt' 156 + \\WHERE platform IN ('other', 'unknown') 157 + \\AND publication_uri IN (SELECT uri FROM publications WHERE base_path LIKE '%pckt.blog%') 158 + , &.{}) catch {}; 159 + 160 + client.exec( 161 + \\UPDATE documents SET platform = 'leaflet' 162 + \\WHERE platform IN ('other', 'unknown') 163 + \\AND publication_uri IN (SELECT uri FROM publications WHERE base_path LIKE '%leaflet.pub%') 164 + , &.{}) catch {}; 165 + 166 + client.exec( 167 + \\UPDATE documents SET platform = 'offprint' 168 + \\WHERE platform IN ('other', 'unknown') 169 + \\AND publication_uri IN (SELECT uri FROM publications WHERE base_path LIKE '%offprint.app%' OR base_path LIKE '%offprint.test%') 170 + , &.{}) catch {}; 171 + 172 + client.exec( 173 + \\UPDATE documents SET platform = 'greengale' 174 + \\WHERE platform IN ('other', 'unknown') 175 + \\AND publication_uri IN (SELECT uri FROM publications WHERE base_path LIKE '%greengale.app%') 176 + , &.{}) catch {}; 177 + 178 + // URL path field for documents (e.g., "/001" for zat.dev) 179 + // used to build full URL: publication.url + document.path 180 + client.exec("ALTER TABLE documents ADD COLUMN path TEXT", &.{}) catch {}; 181 + 182 + // denormalized columns for query performance (avoids per-row subqueries) 183 + client.exec("ALTER TABLE documents ADD COLUMN base_path TEXT DEFAULT ''", &.{}) catch {}; 184 + client.exec("ALTER TABLE documents ADD COLUMN has_publication INTEGER DEFAULT 0", &.{}) catch {}; 185 + 186 + // backfill base_path from publications (idempotent - only updates empty values) 187 + client.exec( 188 + \\UPDATE documents SET base_path = COALESCE( 189 + \\ (SELECT p.base_path FROM publications p WHERE p.uri = documents.publication_uri), 190 + \\ (SELECT p.base_path FROM publications p WHERE p.did = documents.did LIMIT 1), 191 + \\ '' 192 + \\) WHERE base_path IS NULL OR base_path = '' 193 + , &.{}) catch {}; 194 + 195 + // backfill has_publication (idempotent) 196 + client.exec( 197 + "UPDATE documents SET has_publication = CASE WHEN publication_uri != '' THEN 1 ELSE 0 END WHERE has_publication = 0 AND publication_uri != ''", 198 + &.{}, 199 + ) catch {}; 200 + 201 + // note: publications_fts was rebuilt with base_path column via scripts/rebuild-pub-fts 202 + // new publications will include base_path via insertPublication in indexer.zig 203 + 204 + // 2026-01-22: clean up stale publication/self records that were deleted from ATProto 205 + // these cause incorrect basePath lookups for greengale documents 206 + // specifically: did:plc:27ivzcszryxp6mehutodmcxo had publication/self with basePath 'greengale.app' 207 + // but that publication was deleted, and the correct one is 'greengale.app/3fz.org' 208 + client.exec( 209 + \\DELETE FROM publications WHERE rkey = 'self' 210 + \\AND base_path = 'greengale.app' 211 + \\AND did = 'did:plc:27ivzcszryxp6mehutodmcxo' 212 + , &.{}) catch {}; 213 + client.exec( 214 + \\DELETE FROM publications_fts WHERE uri IN ( 215 + \\ SELECT 'at://' || did || '/site.standard.publication/self' 216 + \\ FROM publications WHERE rkey = 'self' AND base_path = 'greengale.app' 217 + \\) 218 + , &.{}) catch {}; 219 + 220 + // re-derive basePath for greengale documents that got wrong basePath 221 + // match documents to greengale publications (basePath contains greengale.app) 222 + // prefer more specific basePaths (with subdomain) 223 + client.exec( 224 + \\UPDATE documents SET base_path = ( 225 + \\ SELECT p.base_path FROM publications p 226 + \\ WHERE p.did = documents.did 227 + \\ AND p.base_path LIKE 'greengale.app/%' 228 + \\ ORDER BY LENGTH(p.base_path) DESC 229 + \\ LIMIT 1 230 + \\) 231 + \\WHERE platform = 'greengale' 232 + \\AND (base_path = 'greengale.app' OR base_path LIKE '%pckt.blog%') 233 + \\AND did IN (SELECT did FROM publications WHERE base_path LIKE 'greengale.app/%') 234 + , &.{}) catch {}; 131 235 }

+343

backend/src/db/sync.zig

··· 1 + //! Sync from Turso to local SQLite 2 + //! Full sync on startup, incremental sync periodically 3 + 4 + const std = @import("std"); 5 + const zqlite = @import("zqlite"); 6 + const Allocator = std.mem.Allocator; 7 + const Client = @import("Client.zig"); 8 + const LocalDb = @import("LocalDb.zig"); 9 + 10 + const BATCH_SIZE = 500; 11 + 12 + /// Full sync: fetch all data from Turso and populate local SQLite 13 + pub fn fullSync(turso: *Client, local: *LocalDb) !void { 14 + std.debug.print("sync: starting full sync...\n", .{}); 15 + 16 + local.setReady(false); 17 + 18 + const conn = local.getConn() orelse return error.LocalNotOpen; 19 + 20 + local.lock(); 21 + defer local.unlock(); 22 + 23 + // start transaction for bulk insert 24 + conn.exec("BEGIN IMMEDIATE", .{}) catch |err| { 25 + std.debug.print("sync: failed to begin transaction: {}\n", .{err}); 26 + return err; 27 + }; 28 + errdefer conn.exec("ROLLBACK", .{}) catch {}; 29 + 30 + // clear existing data 31 + conn.exec("DELETE FROM documents_fts", .{}) catch {}; 32 + conn.exec("DELETE FROM documents", .{}) catch {}; 33 + conn.exec("DELETE FROM publications_fts", .{}) catch {}; 34 + conn.exec("DELETE FROM publications", .{}) catch {}; 35 + conn.exec("DELETE FROM document_tags", .{}) catch {}; 36 + 37 + // sync documents in batches 38 + var doc_count: usize = 0; 39 + var offset: usize = 0; 40 + while (true) { 41 + var offset_buf: [16]u8 = undefined; 42 + const offset_str = std.fmt.bufPrint(&offset_buf, "{d}", .{offset}) catch break; 43 + 44 + var result = turso.query( 45 + \\SELECT uri, did, rkey, title, content, created_at, publication_uri, 46 + \\ platform, source_collection, path, base_path, has_publication 47 + \\FROM documents 48 + \\ORDER BY uri 49 + \\LIMIT 500 OFFSET ? 50 + , &.{offset_str}) catch |err| { 51 + std.debug.print("sync: turso query failed: {}\n", .{err}); 52 + break; 53 + }; 54 + defer result.deinit(); 55 + 56 + if (result.rows.len == 0) break; 57 + 58 + for (result.rows) |row| { 59 + insertDocumentLocal(conn, row) catch |err| { 60 + std.debug.print("sync: insert doc failed: {}\n", .{err}); 61 + }; 62 + doc_count += 1; 63 + } 64 + 65 + offset += result.rows.len; 66 + if (offset % 1000 == 0) { 67 + std.debug.print("sync: synced {d} documents...\n", .{offset}); 68 + } 69 + } 70 + 71 + // sync publications 72 + var pub_count: usize = 0; 73 + { 74 + var pub_result = turso.query( 75 + "SELECT uri, did, rkey, name, description, base_path, platform FROM publications", 76 + &.{}, 77 + ) catch |err| { 78 + std.debug.print("sync: turso publications query failed: {}\n", .{err}); 79 + conn.exec("COMMIT", .{}) catch {}; 80 + local.setReady(true); 81 + return; 82 + }; 83 + defer pub_result.deinit(); 84 + 85 + for (pub_result.rows) |row| { 86 + insertPublicationLocal(conn, row) catch |err| { 87 + std.debug.print("sync: insert pub failed: {}\n", .{err}); 88 + }; 89 + pub_count += 1; 90 + } 91 + } 92 + 93 + // sync tags 94 + var tag_count: usize = 0; 95 + { 96 + var tags_result = turso.query( 97 + "SELECT document_uri, tag FROM document_tags", 98 + &.{}, 99 + ) catch |err| { 100 + std.debug.print("sync: turso tags query failed: {}\n", .{err}); 101 + conn.exec("COMMIT", .{}) catch {}; 102 + local.setReady(true); 103 + return; 104 + }; 105 + defer tags_result.deinit(); 106 + 107 + for (tags_result.rows) |row| { 108 + conn.exec( 109 + "INSERT OR IGNORE INTO document_tags (document_uri, tag) VALUES (?, ?)", 110 + .{ row.text(0), row.text(1) }, 111 + ) catch {}; 112 + tag_count += 1; 113 + } 114 + } 115 + 116 + // sync popular searches 117 + var popular_count: usize = 0; 118 + { 119 + conn.exec("DELETE FROM popular_searches", .{}) catch {}; 120 + 121 + var popular_result = turso.query( 122 + "SELECT query, count FROM popular_searches", 123 + &.{}, 124 + ) catch |err| { 125 + std.debug.print("sync: turso popular_searches query failed: {}\n", .{err}); 126 + conn.exec("COMMIT", .{}) catch {}; 127 + local.setReady(true); 128 + return; 129 + }; 130 + defer popular_result.deinit(); 131 + 132 + for (popular_result.rows) |row| { 133 + conn.exec( 134 + "INSERT OR REPLACE INTO popular_searches (query, count) VALUES (?, ?)", 135 + .{ row.text(0), row.text(1) }, 136 + ) catch {}; 137 + popular_count += 1; 138 + } 139 + } 140 + 141 + // sync similarity cache 142 + var cache_count: usize = 0; 143 + { 144 + conn.exec("DELETE FROM similarity_cache", .{}) catch {}; 145 + 146 + if (turso.query( 147 + "SELECT source_uri, results, doc_count, computed_at FROM similarity_cache", 148 + &.{}, 149 + )) |res_val| { 150 + var res = res_val; 151 + defer res.deinit(); 152 + 153 + for (res.rows) |row| { 154 + conn.exec( 155 + "INSERT OR REPLACE INTO similarity_cache (source_uri, results, doc_count, computed_at) VALUES (?, ?, ?, ?)", 156 + .{ row.text(0), row.text(1), row.text(2), row.text(3) }, 157 + ) catch {}; 158 + cache_count += 1; 159 + } 160 + } else |err| { 161 + std.debug.print("sync: turso similarity_cache query failed: {}\n", .{err}); 162 + // continue anyway - cache isn't critical 163 + } 164 + } 165 + 166 + // record sync time 167 + var ts_buf: [20]u8 = undefined; 168 + const ts_str = std.fmt.bufPrint(&ts_buf, "{d}", .{std.time.timestamp()}) catch "0"; 169 + conn.exec( 170 + "INSERT OR REPLACE INTO sync_meta (key, value) VALUES ('last_sync', ?)", 171 + .{ts_str}, 172 + ) catch {}; 173 + 174 + conn.exec("COMMIT", .{}) catch |err| { 175 + std.debug.print("sync: commit failed: {}\n", .{err}); 176 + return err; 177 + }; 178 + 179 + // checkpoint WAL to prevent unbounded growth 180 + conn.exec("PRAGMA wal_checkpoint(TRUNCATE)", .{}) catch |err| { 181 + std.debug.print("sync: wal checkpoint failed: {}\n", .{err}); 182 + }; 183 + 184 + local.setReady(true); 185 + std.debug.print("sync: full sync complete - {d} docs, {d} pubs, {d} tags, {d} popular, {d} cached\n", .{ doc_count, pub_count, tag_count, popular_count, cache_count }); 186 + } 187 + 188 + /// Incremental sync: fetch documents created since last sync 189 + pub fn incrementalSync(turso: *Client, local: *LocalDb) !void { 190 + const conn = local.getConn() orelse return error.LocalNotOpen; 191 + 192 + // get last sync time 193 + local.lock(); 194 + const last_sync_ts = blk: { 195 + const row = conn.row( 196 + "SELECT value FROM sync_meta WHERE key = 'last_sync'", 197 + .{}, 198 + ) catch { 199 + local.unlock(); 200 + break :blk @as(i64, 0); 201 + }; 202 + if (row) |r| { 203 + defer r.deinit(); 204 + const val = r.text(0); 205 + local.unlock(); 206 + // empty string (NULL) or invalid -> 0 207 + break :blk if (val.len == 0) 0 else std.fmt.parseInt(i64, val, 10) catch 0; 208 + } 209 + local.unlock(); 210 + break :blk @as(i64, 0); 211 + }; 212 + 213 + if (last_sync_ts == 0) { 214 + // no previous sync, do full sync 215 + std.debug.print("sync: no last_sync found, doing full sync\n", .{}); 216 + return fullSync(turso, local); 217 + } 218 + 219 + // convert timestamp to ISO date for query 220 + // rough estimate: subtract 5 minutes buffer to catch any stragglers 221 + const since_ts = last_sync_ts - 300; 222 + const epoch_secs: u64 = @intCast(since_ts); 223 + const epoch = std.time.epoch.EpochSeconds{ .secs = epoch_secs }; 224 + const day_secs = epoch.getDaySeconds(); 225 + const year_day = epoch.getEpochDay().calculateYearDay(); 226 + const month_day = year_day.calculateMonthDay(); 227 + 228 + var since_buf: [24]u8 = undefined; 229 + const since_str = std.fmt.bufPrint(&since_buf, "{d:0>4}-{d:0>2}-{d:0>2}T{d:0>2}:{d:0>2}:{d:0>2}", .{ 230 + year_day.year, 231 + @intFromEnum(month_day.month), 232 + month_day.day_index + 1, 233 + day_secs.getHoursIntoDay(), 234 + day_secs.getMinutesIntoHour(), 235 + day_secs.getSecondsIntoMinute(), 236 + }) catch { 237 + std.debug.print("sync: failed to format since date\n", .{}); 238 + return; 239 + }; 240 + 241 + std.debug.print("sync: incremental sync since {s}\n", .{since_str}); 242 + 243 + // fetch new documents 244 + var new_docs: usize = 0; 245 + { 246 + var result = turso.query( 247 + \\SELECT uri, did, rkey, title, content, created_at, publication_uri, 248 + \\ platform, source_collection, path, base_path, has_publication 249 + \\FROM documents 250 + \\WHERE created_at >= ? 251 + \\ORDER BY created_at 252 + , &.{since_str}) catch |err| { 253 + std.debug.print("sync: incremental query failed: {}\n", .{err}); 254 + return; 255 + }; 256 + defer result.deinit(); 257 + 258 + local.lock(); 259 + defer local.unlock(); 260 + 261 + for (result.rows) |row| { 262 + insertDocumentLocal(conn, row) catch {}; 263 + new_docs += 1; 264 + } 265 + 266 + // update sync time 267 + var ts_buf: [20]u8 = undefined; 268 + const ts_str = std.fmt.bufPrint(&ts_buf, "{d}", .{std.time.timestamp()}) catch "0"; 269 + conn.exec( 270 + "INSERT OR REPLACE INTO sync_meta (key, value) VALUES ('last_sync', ?)", 271 + .{ts_str}, 272 + ) catch {}; 273 + } 274 + 275 + // periodic WAL checkpoint to prevent unbounded growth 276 + local.lock(); 277 + conn.exec("PRAGMA wal_checkpoint(PASSIVE)", .{}) catch {}; 278 + local.unlock(); 279 + 280 + if (new_docs > 0) { 281 + std.debug.print("sync: incremental sync added {d} new documents\n", .{new_docs}); 282 + } 283 + } 284 + 285 + fn insertDocumentLocal(conn: zqlite.Conn, row: anytype) !void { 286 + // insert into main table 287 + conn.exec( 288 + \\INSERT OR REPLACE INTO documents 289 + \\(uri, did, rkey, title, content, created_at, publication_uri, 290 + \\ platform, source_collection, path, base_path, has_publication) 291 + \\VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?) 292 + , .{ 293 + row.text(0), // uri 294 + row.text(1), // did 295 + row.text(2), // rkey 296 + row.text(3), // title 297 + row.text(4), // content 298 + row.text(5), // created_at 299 + row.text(6), // publication_uri 300 + row.text(7), // platform 301 + row.text(8), // source_collection 302 + row.text(9), // path 303 + row.text(10), // base_path 304 + row.int(11), // has_publication 305 + }) catch |err| { 306 + return err; 307 + }; 308 + 309 + // update FTS 310 + const uri = row.text(0); 311 + conn.exec("DELETE FROM documents_fts WHERE uri = ?", .{uri}) catch {}; 312 + conn.exec( 313 + "INSERT INTO documents_fts (uri, title, content) VALUES (?, ?, ?)", 314 + .{ uri, row.text(3), row.text(4) }, 315 + ) catch {}; 316 + } 317 + 318 + fn insertPublicationLocal(conn: zqlite.Conn, row: anytype) !void { 319 + // insert into main table (no created_at - Turso publications table doesn't have it) 320 + conn.exec( 321 + \\INSERT OR REPLACE INTO publications 322 + \\(uri, did, rkey, name, description, base_path, platform) 323 + \\VALUES (?, ?, ?, ?, ?, ?, ?) 324 + , .{ 325 + row.text(0), // uri 326 + row.text(1), // did 327 + row.text(2), // rkey 328 + row.text(3), // name 329 + row.text(4), // description 330 + row.text(5), // base_path 331 + row.text(6), // platform 332 + }) catch |err| { 333 + return err; 334 + }; 335 + 336 + // update FTS 337 + const uri = row.text(0); 338 + conn.exec("DELETE FROM publications_fts WHERE uri = ?", .{uri}) catch {}; 339 + conn.exec( 340 + "INSERT INTO publications_fts (uri, name, description, base_path) VALUES (?, ?, ?, ?)", 341 + .{ uri, row.text(3), row.text(4), row.text(5) }, 342 + ) catch {}; 343 + }

+311

backend/src/embedder.zig

··· 1 + //! Background worker for generating document embeddings via Voyage AI. 2 + //! 3 + //! Periodically queries for documents missing embeddings, batches them, 4 + //! calls the Voyage API, and updates Turso with the results. 5 + 6 + const std = @import("std"); 7 + const http = std.http; 8 + const json = std.json; 9 + const mem = std.mem; 10 + const posix = std.posix; 11 + const Allocator = mem.Allocator; 12 + const logfire = @import("logfire"); 13 + const db = @import("db/mod.zig"); 14 + 15 + // voyage-3-lite limits 16 + const MAX_BATCH_SIZE = 20; // conservative batch size for reliability 17 + const MAX_CONTENT_CHARS = 8000; // ~2000 tokens, well under 32K limit 18 + const EMBEDDING_DIM = 512; 19 + const POLL_INTERVAL_SECS: u64 = 60; // check for new docs every minute 20 + const ERROR_BACKOFF_SECS: u64 = 300; // 5 min backoff on errors 21 + 22 + /// Start the embedder background worker 23 + pub fn start(allocator: Allocator) void { 24 + const api_key = posix.getenv("VOYAGE_API_KEY") orelse { 25 + logfire.info("embedder: VOYAGE_API_KEY not set, embeddings disabled", .{}); 26 + return; 27 + }; 28 + 29 + const thread = std.Thread.spawn(.{}, worker, .{ allocator, api_key }) catch |err| { 30 + logfire.err("embedder: failed to start thread: {}", .{err}); 31 + return; 32 + }; 33 + thread.detach(); 34 + logfire.info("embedder: background worker started", .{}); 35 + } 36 + 37 + fn worker(allocator: Allocator, api_key: []const u8) void { 38 + // wait for db to be ready 39 + std.Thread.sleep(5 * std.time.ns_per_s); 40 + 41 + var consecutive_errors: u32 = 0; 42 + 43 + while (true) { 44 + const processed = processNextBatch(allocator, api_key) catch |err| { 45 + consecutive_errors += 1; 46 + const backoff: u64 = @min(ERROR_BACKOFF_SECS * consecutive_errors, 3600); 47 + logfire.warn("embedder: error {}, backing off {d}s", .{ err, backoff }); 48 + std.Thread.sleep(backoff * std.time.ns_per_s); 49 + continue; 50 + }; 51 + 52 + if (processed > 0) { 53 + consecutive_errors = 0; 54 + logfire.counter("embedder.documents_processed", @intCast(processed)); 55 + // immediately check for more 56 + continue; 57 + } 58 + 59 + // no work, sleep 60 + consecutive_errors = 0; 61 + std.Thread.sleep(POLL_INTERVAL_SECS * std.time.ns_per_s); 62 + } 63 + } 64 + 65 + const DocToEmbed = struct { 66 + uri: []const u8, 67 + text: []const u8, // title + " " + content (truncated) 68 + }; 69 + 70 + fn processNextBatch(allocator: Allocator, api_key: []const u8) !usize { 71 + const span = logfire.span("embedder.process_batch", .{}); 72 + defer span.end(); 73 + 74 + const client = db.getClient() orelse return error.NoClient; 75 + 76 + // query for documents needing embeddings 77 + var result = try client.query( 78 + "SELECT uri, title, content FROM documents WHERE embedding IS NULL LIMIT ?", 79 + &.{std.fmt.comptimePrint("{}", .{MAX_BATCH_SIZE})}, 80 + ); 81 + defer result.deinit(); 82 + 83 + // collect documents 84 + var docs: std.ArrayList(DocToEmbed) = .empty; 85 + defer { 86 + for (docs.items) |doc| { 87 + allocator.free(doc.text); 88 + } 89 + docs.deinit(allocator); 90 + } 91 + 92 + for (result.rows) |row| { 93 + const uri = row.text(0); 94 + const title = row.text(1); 95 + const content = row.text(2); 96 + 97 + // build text for embedding: title + content, truncated 98 + const text = try buildEmbeddingText(allocator, title, content); 99 + try docs.append(allocator, .{ .uri = uri, .text = text }); 100 + } 101 + 102 + if (docs.items.len == 0) return 0; 103 + 104 + // call Voyage API 105 + const embeddings = try callVoyageApi(allocator, api_key, docs.items); 106 + defer { 107 + for (embeddings) |e| allocator.free(e); 108 + allocator.free(embeddings); 109 + } 110 + 111 + // update Turso with embeddings 112 + for (docs.items, embeddings) |doc, embedding| { 113 + updateDocumentEmbedding(client, doc.uri, embedding) catch |err| { 114 + logfire.err("embedder: failed to update {s}: {}", .{ doc.uri, err }); 115 + }; 116 + } 117 + 118 + return docs.items.len; 119 + } 120 + 121 + fn buildEmbeddingText(allocator: Allocator, title: []const u8, content: []const u8) ![]u8 { 122 + // truncate content if needed 123 + const max_content = MAX_CONTENT_CHARS -| title.len -| 1; 124 + const truncated_content = if (content.len > max_content) content[0..max_content] else content; 125 + 126 + const text = try allocator.alloc(u8, title.len + 1 + truncated_content.len); 127 + @memcpy(text[0..title.len], title); 128 + text[title.len] = ' '; 129 + @memcpy(text[title.len + 1 ..], truncated_content); 130 + 131 + // sanitize to valid UTF-8 (replace invalid bytes with space) 132 + // this ensures json.Stringify treats it as a string, not byte array 133 + sanitizeUtf8(text); 134 + 135 + return text; 136 + } 137 + 138 + fn sanitizeUtf8(text: []u8) void { 139 + var i: usize = 0; 140 + while (i < text.len) { 141 + const len = std.unicode.utf8ByteSequenceLength(text[i]) catch { 142 + text[i] = ' '; // replace invalid start byte 143 + i += 1; 144 + continue; 145 + }; 146 + if (i + len > text.len) { 147 + // truncated sequence at end 148 + text[i] = ' '; 149 + i += 1; 150 + continue; 151 + } 152 + // validate the full sequence 153 + _ = std.unicode.utf8Decode(text[i..][0..len]) catch { 154 + text[i] = ' '; // replace invalid sequence start 155 + i += 1; 156 + continue; 157 + }; 158 + i += len; 159 + } 160 + } 161 + 162 + fn callVoyageApi(allocator: Allocator, api_key: []const u8, docs: []const DocToEmbed) ![][]f32 { 163 + const span = logfire.span("embedder.voyage_api", .{ 164 + .batch_size = @as(i64, @intCast(docs.len)), 165 + }); 166 + defer span.end(); 167 + 168 + var http_client: http.Client = .{ .allocator = allocator }; 169 + defer http_client.deinit(); 170 + 171 + // build request body 172 + const body = try buildVoyageRequest(allocator, docs); 173 + defer allocator.free(body); 174 + 175 + // prepare auth header 176 + var auth_buf: [256]u8 = undefined; 177 + const auth = std.fmt.bufPrint(&auth_buf, "Bearer {s}", .{api_key}) catch 178 + return error.AuthTooLong; 179 + 180 + // make request 181 + var response_body: std.Io.Writer.Allocating = .init(allocator); 182 + errdefer response_body.deinit(); 183 + 184 + const res = http_client.fetch(.{ 185 + .location = .{ .url = "https://api.voyageai.com/v1/embeddings" }, 186 + .method = .POST, 187 + .headers = .{ 188 + .content_type = .{ .override = "application/json" }, 189 + .authorization = .{ .override = auth }, 190 + }, 191 + .payload = body, 192 + .response_writer = &response_body.writer, 193 + }) catch |err| { 194 + logfire.err("embedder: voyage request failed: {}", .{err}); 195 + return error.VoyageRequestFailed; 196 + }; 197 + 198 + if (res.status != .ok) { 199 + const resp_text = response_body.toOwnedSlice() catch ""; 200 + defer if (resp_text.len > 0) allocator.free(resp_text); 201 + logfire.err("embedder: voyage error {}: {s}", .{ res.status, resp_text[0..@min(resp_text.len, 200)] }); 202 + return error.VoyageApiError; 203 + } 204 + 205 + const response_text = try response_body.toOwnedSlice(); 206 + defer allocator.free(response_text); 207 + 208 + return parseVoyageResponse(allocator, response_text, docs.len); 209 + } 210 + 211 + fn buildVoyageRequest(allocator: Allocator, docs: []const DocToEmbed) ![]const u8 { 212 + var body: std.Io.Writer.Allocating = .init(allocator); 213 + errdefer body.deinit(); 214 + var jw: json.Stringify = .{ .writer = &body.writer, .options = .{} }; 215 + 216 + try jw.beginObject(); 217 + 218 + try jw.objectField("model"); 219 + try jw.write("voyage-3-lite"); 220 + 221 + try jw.objectField("input_type"); 222 + try jw.write("document"); 223 + 224 + try jw.objectField("input"); 225 + try jw.beginArray(); 226 + for (docs) |doc| { 227 + try jw.write(doc.text); 228 + } 229 + try jw.endArray(); 230 + 231 + try jw.endObject(); 232 + 233 + return try body.toOwnedSlice(); 234 + } 235 + 236 + fn parseVoyageResponse(allocator: Allocator, response: []const u8, expected_count: usize) ![][]f32 { 237 + const parsed = json.parseFromSlice(json.Value, allocator, response, .{}) catch { 238 + logfire.err("embedder: failed to parse voyage response", .{}); 239 + return error.ParseError; 240 + }; 241 + defer parsed.deinit(); 242 + 243 + const data = parsed.value.object.get("data") orelse return error.MissingData; 244 + if (data != .array) return error.InvalidData; 245 + 246 + if (data.array.items.len != expected_count) { 247 + logfire.err("embedder: expected {d} embeddings, got {d}", .{ expected_count, data.array.items.len }); 248 + return error.CountMismatch; 249 + } 250 + 251 + const embeddings = try allocator.alloc([]f32, expected_count); 252 + errdefer { 253 + for (embeddings) |e| allocator.free(e); 254 + allocator.free(embeddings); 255 + } 256 + 257 + for (data.array.items, 0..) |item, i| { 258 + const embedding_val = item.object.get("embedding") orelse return error.MissingEmbedding; 259 + if (embedding_val != .array) return error.InvalidEmbedding; 260 + 261 + const embedding = try allocator.alloc(f32, EMBEDDING_DIM); 262 + errdefer allocator.free(embedding); 263 + 264 + if (embedding_val.array.items.len != EMBEDDING_DIM) { 265 + std.debug.print("embedder: expected {} dims, got {}\n", .{ EMBEDDING_DIM, embedding_val.array.items.len }); 266 + return error.DimensionMismatch; 267 + } 268 + 269 + for (embedding_val.array.items, 0..) |val, j| { 270 + embedding[j] = switch (val) { 271 + .float => @floatCast(val.float), 272 + .integer => @floatFromInt(val.integer), 273 + else => return error.InvalidValue, 274 + }; 275 + } 276 + embeddings[i] = embedding; 277 + } 278 + 279 + return embeddings; 280 + } 281 + 282 + fn updateDocumentEmbedding(client: *db.Client, uri: []const u8, embedding: []f32) !void { 283 + const allocator = client.allocator; 284 + 285 + // serialize embedding to JSON array string for vector32() 286 + var embedding_json: std.ArrayList(u8) = .empty; 287 + defer embedding_json.deinit(allocator); 288 + 289 + try embedding_json.append(allocator, '['); 290 + for (embedding, 0..) |val, i| { 291 + if (i > 0) try embedding_json.append(allocator, ','); 292 + var buf: [32]u8 = undefined; 293 + const str = std.fmt.bufPrint(&buf, "{d:.6}", .{val}) catch continue; 294 + try embedding_json.appendSlice(allocator, str); 295 + } 296 + try embedding_json.append(allocator, ']'); 297 + 298 + // use batch API to execute dynamic SQL 299 + const statements = [_]db.Client.Statement{ 300 + .{ 301 + .sql = "UPDATE documents SET embedding = vector32(?) WHERE uri = ?", 302 + .args = &.{ embedding_json.items, uri }, 303 + }, 304 + }; 305 + 306 + var result = client.queryBatch(&statements) catch |err| { 307 + std.debug.print("embedder: update failed for {s}: {}\n", .{ uri, err }); 308 + return err; 309 + }; 310 + defer result.deinit(); 311 + }

+88 -36

backend/src/extractor.zig

··· 4 4 const Allocator = mem.Allocator; 5 5 const zat = @import("zat"); 6 6 7 - /// Detected platform from content.$type 7 + /// Detected platform from collection name 8 + /// Note: pckt, offprint, and other platforms use site.standard.* collections. 9 + /// Platform detection from collection only distinguishes leaflet (custom lexicon) 10 + /// from site.standard users. Actual platform (pckt/offprint/etc) is detected later 11 + /// from publication basePath. Documents that don't match any known platform are "other". 8 12 pub const Platform = enum { 9 13 leaflet, 10 - pckt, 11 - offprint, 14 + other, // site.standard.* documents not matching a known platform 12 15 unknown, 13 16 14 - pub fn fromContentType(content_type: []const u8) Platform { 15 - if (mem.startsWith(u8, content_type, "pub.leaflet.")) return .leaflet; 16 - if (mem.startsWith(u8, content_type, "blog.pckt.")) return .pckt; 17 - if (mem.startsWith(u8, content_type, "app.offprint.")) return .offprint; 17 + pub fn fromCollection(collection: []const u8) Platform { 18 + if (mem.startsWith(u8, collection, "pub.leaflet.")) return .leaflet; 19 + if (mem.startsWith(u8, collection, "site.standard.")) return .other; 18 20 return .unknown; 19 21 } 20 22 23 + /// Internal name (for DB storage) 21 24 pub fn name(self: Platform) []const u8 { 22 25 return @tagName(self); 23 26 } 27 + 28 + /// Display name (for UI) 29 + pub fn displayName(self: Platform) []const u8 { 30 + return @tagName(self); 31 + } 24 32 }; 25 33 26 34 /// Extracted document data ready for indexing. 27 - /// All string fields are owned by this struct and must be freed via deinit(). 35 + /// Only `content` and `tags` are allocated - other fields borrow from parsed JSON. 28 36 pub const ExtractedDocument = struct { 29 37 allocator: Allocator, 30 38 title: []const u8, ··· 34 42 tags: [][]const u8, 35 43 platform: Platform, 36 44 source_collection: []const u8, 45 + path: ?[]const u8, // URL path from record (e.g., "/001" for zat.dev) 46 + content_type: ?[]const u8, // content.$type (e.g., "pub.leaflet.content") for platform detection 37 47 38 48 pub fn deinit(self: *ExtractedDocument) void { 39 49 self.allocator.free(self.content); 40 50 self.allocator.free(self.tags); 41 51 } 42 52 53 + /// Transfer ownership of content to caller. Caller must free returned slice. 54 + /// After calling, deinit() will only free tags. 55 + pub fn takeContent(self: *ExtractedDocument) []u8 { 56 + const content = self.content; 57 + self.content = &.{}; 58 + return content; 59 + } 60 + 43 61 /// Platform name as string (for DB storage) 44 62 pub fn platformName(self: ExtractedDocument) []const u8 { 45 63 return self.platform.name(); ··· 54 72 .{ "pub.leaflet.blocks.code", {} }, 55 73 }); 56 74 57 - /// Detect platform from record's content.$type field 58 - pub fn detectPlatform(record: json.ObjectMap) Platform { 59 - const content = record.get("content") orelse return .unknown; 60 - if (content != .object) return .unknown; 61 - 62 - const type_val = content.object.get("$type") orelse return .unknown; 63 - if (type_val != .string) return .unknown; 64 - 65 - return Platform.fromContentType(type_val.string); 75 + /// Detect platform from collection name 76 + pub fn detectPlatform(collection: []const u8) Platform { 77 + return Platform.fromCollection(collection); 66 78 } 67 79 68 80 /// Extract document content from a record. ··· 73 85 collection: []const u8, 74 86 ) !ExtractedDocument { 75 87 const record_val: json.Value = .{ .object = record }; 76 - const platform = detectPlatform(record); 88 + const platform = detectPlatform(collection); 77 89 78 90 // extract required fields 79 91 const title = zat.json.getString(record_val, "title") orelse return error.MissingTitle; ··· 81 93 // extract optional fields 82 94 const created_at = zat.json.getString(record_val, "publishedAt") orelse 83 95 zat.json.getString(record_val, "createdAt"); 96 + 97 + // publication/site can be a string (direct URI) or strongRef object ({uri, cid}) 98 + // zat.json.getString supports paths like "publication.uri" 84 99 const publication_uri = zat.json.getString(record_val, "publication") orelse 85 - zat.json.getString(record_val, "site"); // site.standard uses "site" 100 + zat.json.getString(record_val, "publication.uri") orelse 101 + zat.json.getString(record_val, "site") orelse 102 + zat.json.getString(record_val, "site.uri"); 103 + 104 + // extract URL path (site.standard.document uses "path" field like "/001") 105 + const path = zat.json.getString(record_val, "path"); 106 + 107 + // extract content.$type for platform detection (e.g., "pub.leaflet.content") 108 + const content_type = zat.json.getString(record_val, "content.$type"); 86 109 87 110 // extract tags - allocate owned slice 88 111 const tags = try extractTags(allocator, record_val); ··· 100 123 .tags = tags, 101 124 .platform = platform, 102 125 .source_collection = collection, 126 + .path = path, 127 + .content_type = content_type, 103 128 }; 104 129 } 105 130 ··· 138 163 try buf.appendSlice(allocator, desc); 139 164 } 140 165 141 - if (zat.json.getArray(record, "pages")) |pages| { 142 - for (pages) |page| { 166 + // check for pages at top level (pub.leaflet.document) 167 + // or nested in content object (site.standard.document with pub.leaflet.content) 168 + const pages = zat.json.getArray(record, "pages") orelse 169 + zat.json.getArray(record, "content.pages"); 170 + 171 + if (pages) |p| { 172 + for (p) |page| { 143 173 if (page == .object) { 144 174 try extractPageContent(allocator, &buf, page.object); 145 175 } ··· 222 252 223 253 // --- tests --- 224 254 225 - test "Platform.fromContentType: leaflet" { 226 - try std.testing.expectEqual(Platform.leaflet, Platform.fromContentType("pub.leaflet.content")); 227 - try std.testing.expectEqual(Platform.leaflet, Platform.fromContentType("pub.leaflet.blocks.text")); 255 + test "Platform.fromCollection: leaflet" { 256 + try std.testing.expectEqual(Platform.leaflet, Platform.fromCollection("pub.leaflet.document")); 257 + try std.testing.expectEqual(Platform.leaflet, Platform.fromCollection("pub.leaflet.publication")); 228 258 } 229 259 230 - test "Platform.fromContentType: pckt" { 231 - try std.testing.expectEqual(Platform.pckt, Platform.fromContentType("blog.pckt.content")); 232 - try std.testing.expectEqual(Platform.pckt, Platform.fromContentType("blog.pckt.blocks.whatever")); 260 + test "Platform.fromCollection: other (site.standard.*)" { 261 + // pckt, offprint, and others use site.standard.* collections 262 + // detected as "other" initially, then corrected by basePath in schema migrations 263 + try std.testing.expectEqual(Platform.other, Platform.fromCollection("site.standard.document")); 264 + try std.testing.expectEqual(Platform.other, Platform.fromCollection("site.standard.publication")); 233 265 } 234 266 235 - test "Platform.fromContentType: offprint" { 236 - try std.testing.expectEqual(Platform.offprint, Platform.fromContentType("app.offprint.content")); 237 - } 238 - 239 - test "Platform.fromContentType: unknown" { 240 - try std.testing.expectEqual(Platform.unknown, Platform.fromContentType("something.else")); 241 - try std.testing.expectEqual(Platform.unknown, Platform.fromContentType("")); 267 + test "Platform.fromCollection: unknown" { 268 + try std.testing.expectEqual(Platform.unknown, Platform.fromCollection("something.else")); 269 + try std.testing.expectEqual(Platform.unknown, Platform.fromCollection("")); 242 270 } 243 271 244 272 test "Platform.name" { 245 273 try std.testing.expectEqualStrings("leaflet", Platform.leaflet.name()); 246 - try std.testing.expectEqualStrings("pckt", Platform.pckt.name()); 247 - try std.testing.expectEqualStrings("offprint", Platform.offprint.name()); 274 + try std.testing.expectEqualStrings("other", Platform.other.name()); 248 275 try std.testing.expectEqualStrings("unknown", Platform.unknown.name()); 249 276 } 277 + 278 + test "Platform.displayName" { 279 + try std.testing.expectEqualStrings("leaflet", Platform.leaflet.displayName()); 280 + try std.testing.expectEqualStrings("other", Platform.other.displayName()); 281 + } 282 + 283 + test "extractDocument: site.standard.document with pub.leaflet.content" { 284 + const allocator = std.testing.allocator; 285 + 286 + // minimal site.standard.document with embedded pub.leaflet.content 287 + const test_json = 288 + \\{"title":"Test Post","content":{"$type":"pub.leaflet.content","pages":[{"id":"page1","$type":"pub.leaflet.pages.linearDocument","blocks":[{"$type":"pub.leaflet.pages.linearDocument#block","block":{"$type":"pub.leaflet.blocks.text","plaintext":"Hello world"}}]}]}} 289 + ; 290 + 291 + const parsed = try json.parseFromSlice(json.Value, allocator, test_json, .{}); 292 + defer parsed.deinit(); 293 + 294 + var doc = try extractDocument(allocator, parsed.value.object, "site.standard.document"); 295 + defer doc.deinit(); 296 + 297 + try std.testing.expectEqualStrings("Test Post", doc.title); 298 + try std.testing.expectEqualStrings("Hello world", doc.content); 299 + // content_type should be extracted for platform detection (custom domain support) 300 + try std.testing.expectEqualStrings("pub.leaflet.content", doc.content_type.?); 301 + }

+136 -5

backend/src/indexer.zig

··· 12 12 tags: []const []const u8, 13 13 platform: []const u8, 14 14 source_collection: []const u8, 15 + path: ?[]const u8, 16 + content_type: ?[]const u8, 15 17 ) !void { 16 18 const c = db.getClient() orelse return error.NotInitialized; 17 19 20 + // dedupe: if (did, rkey) exists with different uri, clean up old record first 21 + // this handles cross-collection duplicates (e.g., pub.leaflet.document + site.standard.document) 22 + if (c.query("SELECT uri FROM documents WHERE did = ? AND rkey = ?", &.{ did, rkey })) |result_val| { 23 + var result = result_val; 24 + defer result.deinit(); 25 + if (result.first()) |row| { 26 + const old_uri = row.text(0); 27 + if (!std.mem.eql(u8, old_uri, uri)) { 28 + c.exec("DELETE FROM documents_fts WHERE uri = ?", &.{old_uri}) catch {}; 29 + c.exec("DELETE FROM document_tags WHERE document_uri = ?", &.{old_uri}) catch {}; 30 + c.exec("DELETE FROM documents WHERE uri = ?", &.{old_uri}) catch {}; 31 + } 32 + } 33 + } else |_| {} 34 + 35 + // compute denormalized fields 36 + const pub_uri = publication_uri orelse ""; 37 + const has_pub: []const u8 = if (pub_uri.len > 0) "1" else "0"; 38 + 39 + // look up base_path from publication (or fallback to DID lookup) 40 + // use a stack buffer because row.text() returns a slice into result memory 41 + // which gets freed by result.deinit() 42 + var base_path_buf: [256]u8 = undefined; 43 + var base_path: []const u8 = ""; 44 + 45 + if (pub_uri.len > 0) { 46 + if (c.query("SELECT base_path FROM publications WHERE uri = ?", &.{pub_uri})) |res| { 47 + var result = res; 48 + defer result.deinit(); 49 + if (result.first()) |row| { 50 + const val = row.text(0); 51 + if (val.len > 0 and val.len <= base_path_buf.len) { 52 + @memcpy(base_path_buf[0..val.len], val); 53 + base_path = base_path_buf[0..val.len]; 54 + } 55 + } 56 + } else |_| {} 57 + } 58 + // fallback: find publication by DID, preferring platform-specific matches 59 + if (base_path.len == 0) { 60 + // try platform-specific publication first 61 + const platform_pattern: []const u8 = if (std.mem.eql(u8, platform, "greengale")) 62 + "%greengale.app%" 63 + else if (std.mem.eql(u8, platform, "pckt")) 64 + "%pckt.blog%" 65 + else if (std.mem.eql(u8, platform, "offprint")) 66 + "%offprint.app%" 67 + else if (std.mem.eql(u8, platform, "leaflet")) 68 + "%leaflet.pub%" 69 + else 70 + "%"; 71 + 72 + if (c.query("SELECT base_path FROM publications WHERE did = ? AND base_path LIKE ? ORDER BY LENGTH(base_path) DESC LIMIT 1", &.{ did, platform_pattern })) |res| { 73 + var result = res; 74 + defer result.deinit(); 75 + if (result.first()) |row| { 76 + const val = row.text(0); 77 + if (val.len > 0 and val.len <= base_path_buf.len) { 78 + @memcpy(base_path_buf[0..val.len], val); 79 + base_path = base_path_buf[0..val.len]; 80 + } 81 + } 82 + } else |_| {} 83 + 84 + // if no platform-specific match, fall back to any publication 85 + if (base_path.len == 0) { 86 + if (c.query("SELECT base_path FROM publications WHERE did = ? ORDER BY LENGTH(base_path) DESC LIMIT 1", &.{did})) |res| { 87 + var result = res; 88 + defer result.deinit(); 89 + if (result.first()) |row| { 90 + const val = row.text(0); 91 + if (val.len > 0 and val.len <= base_path_buf.len) { 92 + @memcpy(base_path_buf[0..val.len], val); 93 + base_path = base_path_buf[0..val.len]; 94 + } 95 + } 96 + } else |_| {} 97 + } 98 + } 99 + 100 + // detect platform from basePath if platform is unknown/other 101 + // this handles site.standard.* documents where collection doesn't indicate platform 102 + var actual_platform = platform; 103 + if (std.mem.eql(u8, platform, "unknown") or std.mem.eql(u8, platform, "other")) { 104 + if (std.mem.indexOf(u8, base_path, "leaflet.pub") != null) { 105 + actual_platform = "leaflet"; 106 + } else if (std.mem.indexOf(u8, base_path, "pckt.blog") != null) { 107 + actual_platform = "pckt"; 108 + } else if (std.mem.indexOf(u8, base_path, "offprint.app") != null) { 109 + actual_platform = "offprint"; 110 + } else if (std.mem.indexOf(u8, base_path, "greengale.app") != null) { 111 + actual_platform = "greengale"; 112 + } else if (content_type) |ct| { 113 + // fallback: detect platform from content.$type for custom domains 114 + // e.g., "pub.leaflet.content" indicates leaflet even with custom domain 115 + if (std.mem.startsWith(u8, ct, "pub.leaflet.")) { 116 + actual_platform = "leaflet"; 117 + } 118 + } 119 + } 120 + 121 + // use ON CONFLICT to preserve embedding column (INSERT OR REPLACE would nuke it) 18 122 try c.exec( 19 - "INSERT OR REPLACE INTO documents (uri, did, rkey, title, content, created_at, publication_uri, platform, source_collection) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?)", 20 - &.{ uri, did, rkey, title, content, created_at orelse "", publication_uri orelse "", platform, source_collection }, 123 + \\INSERT INTO documents (uri, did, rkey, title, content, created_at, publication_uri, platform, source_collection, path, base_path, has_publication) 124 + \\VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?) 125 + \\ON CONFLICT(uri) DO UPDATE SET 126 + \\ did = excluded.did, 127 + \\ rkey = excluded.rkey, 128 + \\ title = excluded.title, 129 + \\ content = excluded.content, 130 + \\ created_at = excluded.created_at, 131 + \\ publication_uri = excluded.publication_uri, 132 + \\ platform = excluded.platform, 133 + \\ source_collection = excluded.source_collection, 134 + \\ path = excluded.path, 135 + \\ base_path = excluded.base_path, 136 + \\ has_publication = excluded.has_publication 137 + , 138 + &.{ uri, did, rkey, title, content, created_at orelse "", pub_uri, actual_platform, source_collection, path orelse "", base_path, has_pub }, 21 139 ); 22 140 23 141 // update FTS index ··· 47 165 ) !void { 48 166 const c = db.getClient() orelse return error.NotInitialized; 49 167 168 + // dedupe: if (did, rkey) exists with different uri, clean up old record first 169 + if (c.query("SELECT uri FROM publications WHERE did = ? AND rkey = ?", &.{ did, rkey })) |result_val| { 170 + var result = result_val; 171 + defer result.deinit(); 172 + if (result.first()) |row| { 173 + const old_uri = row.text(0); 174 + if (!std.mem.eql(u8, old_uri, uri)) { 175 + c.exec("DELETE FROM publications_fts WHERE uri = ?", &.{old_uri}) catch {}; 176 + c.exec("DELETE FROM publications WHERE uri = ?", &.{old_uri}) catch {}; 177 + } 178 + } 179 + } else |_| {} 180 + 50 181 try c.exec( 51 182 "INSERT OR REPLACE INTO publications (uri, did, rkey, name, description, base_path) VALUES (?, ?, ?, ?, ?, ?)", 52 183 &.{ uri, did, rkey, name, description orelse "", base_path orelse "" }, 53 184 ); 54 185 55 - // update FTS index 186 + // update FTS index (includes base_path for subdomain search) 56 187 c.exec("DELETE FROM publications_fts WHERE uri = ?", &.{uri}) catch {}; 57 188 c.exec( 58 - "INSERT INTO publications_fts (uri, name, description) VALUES (?, ?, ?)", 59 - &.{ uri, name, description orelse "" }, 189 + "INSERT INTO publications_fts (uri, name, description, base_path) VALUES (?, ?, ?, ?)", 190 + &.{ uri, name, description orelse "", base_path orelse "" }, 60 191 ) catch {}; 61 192 } 62 193

+49 -21

backend/src/main.zig

··· 2 2 const net = std.net; 3 3 const posix = std.posix; 4 4 const Thread = std.Thread; 5 + const logfire = @import("logfire"); 5 6 const db = @import("db/mod.zig"); 6 7 const activity = @import("activity.zig"); 8 + const stats_buffer = @import("stats_buffer.zig"); 7 9 const server = @import("server.zig"); 8 10 const tap = @import("tap.zig"); 11 + const embedder = @import("embedder.zig"); 9 12 10 13 const MAX_HTTP_WORKERS = 16; 11 14 const SOCKET_TIMEOUT_SECS = 30; ··· 15 18 defer _ = gpa.deinit(); 16 19 const allocator = gpa.allocator(); 17 20 18 - // init turso 19 - try db.init(); 21 + // configure logfire (reads LOGFIRE_WRITE_TOKEN from env) 22 + _ = logfire.configure(.{ 23 + .service_name = "leaflet-search", 24 + .service_version = "0.1.0", 25 + .environment = posix.getenv("FLY_APP_NAME") orelse "development", 26 + }) catch |err| { 27 + std.debug.print("logfire init failed: {}, continuing without observability\n", .{err}); 28 + }; 29 + 30 + // start http server FIRST so Fly proxy doesn't timeout 31 + const port: u16 = blk: { 32 + const port_str = posix.getenv("PORT") orelse "3000"; 33 + break :blk std.fmt.parseInt(u16, port_str, 10) catch 3000; 34 + }; 35 + 36 + const address = try net.Address.parseIp("0.0.0.0", port); 37 + var listener = try address.listen(.{ .reuse_address = true }); 38 + defer listener.deinit(); 20 39 21 - // start activity tracker 22 - activity.init(); 40 + const app_name = posix.getenv("APP_NAME") orelse "leaflet-search"; 41 + logfire.info("{s} listening on port {d} (max {d} workers)", .{ app_name, port, MAX_HTTP_WORKERS }); 23 42 24 - // start tap consumer in background 25 - const tap_thread = try Thread.spawn(.{}, tap.consumer, .{allocator}); 26 - defer tap_thread.join(); 43 + // init turso client synchronously (fast, needed for search fallback) 44 + try db.initTurso(); 27 45 28 46 // init thread pool for http connections 29 47 var pool: Thread.Pool = undefined; ··· 33 51 }); 34 52 defer pool.deinit(); 35 53 36 - // start http server 37 - const port: u16 = blk: { 38 - const port_str = posix.getenv("PORT") orelse "3000"; 39 - break :blk std.fmt.parseInt(u16, port_str, 10) catch 3000; 40 - }; 41 - 42 - const address = try net.Address.parseIp("0.0.0.0", port); 43 - var listener = try address.listen(.{ .reuse_address = true }); 44 - defer listener.deinit(); 45 - 46 - std.debug.print("leaflet-search listening on http://0.0.0.0:{d} (max {} workers)\n", .{ port, MAX_HTTP_WORKERS }); 54 + // init local db and other services in background (slow) 55 + const init_thread = try Thread.spawn(.{}, initServices, .{allocator}); 56 + init_thread.detach(); 47 57 48 58 while (true) { 49 59 const conn = listener.accept() catch |err| { 50 - std.debug.print("accept error: {}\n", .{err}); 60 + logfire.err("accept error: {}", .{err}); 51 61 continue; 52 62 }; 53 63 54 64 setSocketTimeout(conn.stream.handle, SOCKET_TIMEOUT_SECS) catch |err| { 55 - std.debug.print("failed to set socket timeout: {}\n", .{err}); 65 + logfire.warn("failed to set socket timeout: {}", .{err}); 56 66 }; 57 67 58 68 pool.spawn(server.handleConnection, .{conn}) catch |err| { 59 - std.debug.print("pool spawn error: {}\n", .{err}); 69 + logfire.err("pool spawn error: {}", .{err}); 60 70 conn.stream.close(); 61 71 }; 62 72 } 73 + } 74 + 75 + fn initServices(allocator: std.mem.Allocator) void { 76 + // init local db (slow - turso already initialized) 77 + db.initLocalDb(); 78 + db.startSync(); 79 + 80 + // start activity tracker 81 + activity.init(); 82 + 83 + // start stats buffer (background sync to Turso) 84 + stats_buffer.init(); 85 + 86 + // start embedder (generates embeddings for new docs) 87 + embedder.start(allocator); 88 + 89 + // start tap consumer 90 + tap.consumer(allocator); 63 91 } 64 92 65 93 fn setSocketTimeout(fd: posix.fd_t, secs: u32) !void {

+502 -54

backend/src/search.zig

··· 5 5 const db = @import("db/mod.zig"); 6 6 const stats = @import("stats.zig"); 7 7 8 + // cached embedded doc count (refresh every 5 minutes) 9 + var cached_doc_count: std.atomic.Value(i64) = std.atomic.Value(i64).init(0); 10 + var doc_count_updated_at: std.atomic.Value(i64) = std.atomic.Value(i64).init(0); 11 + const DOC_COUNT_CACHE_SECS = 300; // 5 minutes 12 + 8 13 // JSON output type for search results 9 14 const SearchResultJson = struct { 10 15 type: []const u8, ··· 16 21 rkey: []const u8, 17 22 basePath: []const u8, 18 23 platform: []const u8, 24 + path: []const u8 = "", // URL path from record (e.g., "/001") 19 25 }; 20 26 21 27 /// Document search result (internal) ··· 29 35 basePath: []const u8, 30 36 hasPublication: bool, 31 37 platform: []const u8, 38 + path: []const u8, 32 39 33 40 fn fromRow(row: db.Row) Doc { 34 41 return .{ ··· 41 48 .basePath = row.text(6), 42 49 .hasPublication = row.int(7) != 0, 43 50 .platform = row.text(8), 51 + .path = row.text(9), 52 + }; 53 + } 54 + 55 + fn fromLocalRow(row: db.LocalDb.Row) Doc { 56 + return .{ 57 + .uri = row.text(0), 58 + .did = row.text(1), 59 + .title = row.text(2), 60 + .snippet = row.text(3), 61 + .createdAt = row.text(4), 62 + .rkey = row.text(5), 63 + .basePath = row.text(6), 64 + .hasPublication = row.int(7) != 0, 65 + .platform = row.text(8), 66 + .path = row.text(9), 44 67 }; 45 68 } 46 69 ··· 55 78 .rkey = self.rkey, 56 79 .basePath = self.basePath, 57 80 .platform = self.platform, 81 + .path = self.path, 58 82 }; 59 83 } 60 84 }; 61 85 62 86 const DocsByTag = zql.Query( 63 87 \\SELECT d.uri, d.did, d.title, '' as snippet, 64 - \\ d.created_at, d.rkey, COALESCE(p.base_path, '') as base_path, 65 - \\ CASE WHEN d.publication_uri != '' THEN 1 ELSE 0 END as has_publication, 66 - \\ d.platform 88 + \\ d.created_at, d.rkey, d.base_path, d.has_publication, 89 + \\ d.platform, COALESCE(d.path, '') as path 67 90 \\FROM documents d 68 - \\LEFT JOIN publications p ON d.publication_uri = p.uri 69 91 \\JOIN document_tags dt ON d.uri = dt.document_uri 70 92 \\WHERE dt.tag = :tag 71 93 \\ORDER BY d.created_at DESC LIMIT 40 ··· 74 96 const DocsByFtsAndTag = zql.Query( 75 97 \\SELECT f.uri, d.did, d.title, 76 98 \\ snippet(documents_fts, 2, '', '', '...', 32) as snippet, 77 - \\ d.created_at, d.rkey, COALESCE(p.base_path, '') as base_path, 78 - \\ CASE WHEN d.publication_uri != '' THEN 1 ELSE 0 END as has_publication, 79 - \\ d.platform 99 + \\ d.created_at, d.rkey, d.base_path, d.has_publication, 100 + \\ d.platform, COALESCE(d.path, '') as path 80 101 \\FROM documents_fts f 81 102 \\JOIN documents d ON f.uri = d.uri 82 - \\LEFT JOIN publications p ON d.publication_uri = p.uri 83 103 \\JOIN document_tags dt ON d.uri = dt.document_uri 84 104 \\WHERE documents_fts MATCH :query AND dt.tag = :tag 85 - \\ORDER BY rank LIMIT 40 105 + \\ORDER BY rank + (julianday('now') - julianday(d.created_at)) / 30.0 LIMIT 40 86 106 ); 87 107 88 108 const DocsByFts = zql.Query( 89 109 \\SELECT f.uri, d.did, d.title, 90 110 \\ snippet(documents_fts, 2, '', '', '...', 32) as snippet, 91 - \\ d.created_at, d.rkey, COALESCE(p.base_path, '') as base_path, 92 - \\ CASE WHEN d.publication_uri != '' THEN 1 ELSE 0 END as has_publication, 93 - \\ d.platform 111 + \\ d.created_at, d.rkey, d.base_path, d.has_publication, 112 + \\ d.platform, COALESCE(d.path, '') as path 94 113 \\FROM documents_fts f 95 114 \\JOIN documents d ON f.uri = d.uri 96 - \\LEFT JOIN publications p ON d.publication_uri = p.uri 97 115 \\WHERE documents_fts MATCH :query 98 - \\ORDER BY rank LIMIT 40 116 + \\ORDER BY rank + (julianday('now') - julianday(d.created_at)) / 30.0 LIMIT 40 117 + ); 118 + 119 + const DocsByFtsAndSince = zql.Query( 120 + \\SELECT f.uri, d.did, d.title, 121 + \\ snippet(documents_fts, 2, '', '', '...', 32) as snippet, 122 + \\ d.created_at, d.rkey, d.base_path, d.has_publication, 123 + \\ d.platform, COALESCE(d.path, '') as path 124 + \\FROM documents_fts f 125 + \\JOIN documents d ON f.uri = d.uri 126 + \\WHERE documents_fts MATCH :query AND d.created_at >= :since 127 + \\ORDER BY rank + (julianday('now') - julianday(d.created_at)) / 30.0 LIMIT 40 128 + ); 129 + 130 + const DocsByFtsAndPlatform = zql.Query( 131 + \\SELECT f.uri, d.did, d.title, 132 + \\ snippet(documents_fts, 2, '', '', '...', 32) as snippet, 133 + \\ d.created_at, d.rkey, d.base_path, d.has_publication, 134 + \\ d.platform, COALESCE(d.path, '') as path 135 + \\FROM documents_fts f 136 + \\JOIN documents d ON f.uri = d.uri 137 + \\WHERE documents_fts MATCH :query AND d.platform = :platform 138 + \\ORDER BY rank + (julianday('now') - julianday(d.created_at)) / 30.0 LIMIT 40 139 + ); 140 + 141 + const DocsByFtsAndPlatformAndSince = zql.Query( 142 + \\SELECT f.uri, d.did, d.title, 143 + \\ snippet(documents_fts, 2, '', '', '...', 32) as snippet, 144 + \\ d.created_at, d.rkey, d.base_path, d.has_publication, 145 + \\ d.platform, COALESCE(d.path, '') as path 146 + \\FROM documents_fts f 147 + \\JOIN documents d ON f.uri = d.uri 148 + \\WHERE documents_fts MATCH :query AND d.platform = :platform AND d.created_at >= :since 149 + \\ORDER BY rank + (julianday('now') - julianday(d.created_at)) / 30.0 LIMIT 40 150 + ); 151 + 152 + const DocsByTagAndPlatform = zql.Query( 153 + \\SELECT d.uri, d.did, d.title, '' as snippet, 154 + \\ d.created_at, d.rkey, d.base_path, d.has_publication, 155 + \\ d.platform, COALESCE(d.path, '') as path 156 + \\FROM documents d 157 + \\JOIN document_tags dt ON d.uri = dt.document_uri 158 + \\WHERE dt.tag = :tag AND d.platform = :platform 159 + \\ORDER BY d.created_at DESC LIMIT 40 160 + ); 161 + 162 + const DocsByFtsAndTagAndPlatform = zql.Query( 163 + \\SELECT f.uri, d.did, d.title, 164 + \\ snippet(documents_fts, 2, '', '', '...', 32) as snippet, 165 + \\ d.created_at, d.rkey, d.base_path, d.has_publication, 166 + \\ d.platform, COALESCE(d.path, '') as path 167 + \\FROM documents_fts f 168 + \\JOIN documents d ON f.uri = d.uri 169 + \\JOIN document_tags dt ON d.uri = dt.document_uri 170 + \\WHERE documents_fts MATCH :query AND dt.tag = :tag AND d.platform = :platform 171 + \\ORDER BY rank + (julianday('now') - julianday(d.created_at)) / 30.0 LIMIT 40 172 + ); 173 + 174 + const DocsByPlatform = zql.Query( 175 + \\SELECT d.uri, d.did, d.title, '' as snippet, 176 + \\ d.created_at, d.rkey, d.base_path, d.has_publication, 177 + \\ d.platform, COALESCE(d.path, '') as path 178 + \\FROM documents d 179 + \\WHERE d.platform = :platform 180 + \\ORDER BY d.created_at DESC LIMIT 40 181 + ); 182 + 183 + // Find documents by their publication's base_path (subdomain search) 184 + // e.g., searching "gyst" finds all docs on gyst.leaflet.pub 185 + // Uses recency decay: recent docs rank higher than old ones with same match 186 + const DocsByPubBasePath = zql.Query( 187 + \\SELECT d.uri, d.did, d.title, '' as snippet, 188 + \\ d.created_at, d.rkey, 189 + \\ p.base_path, 190 + \\ 1 as has_publication, 191 + \\ d.platform, COALESCE(d.path, '') as path 192 + \\FROM documents d 193 + \\JOIN publications p ON d.publication_uri = p.uri 194 + \\JOIN publications_fts pf ON p.uri = pf.uri 195 + \\WHERE publications_fts MATCH :query 196 + \\ORDER BY rank + (julianday('now') - julianday(d.created_at)) / 30.0 LIMIT 40 197 + ); 198 + 199 + const DocsByPubBasePathAndPlatform = zql.Query( 200 + \\SELECT d.uri, d.did, d.title, '' as snippet, 201 + \\ d.created_at, d.rkey, 202 + \\ p.base_path, 203 + \\ 1 as has_publication, 204 + \\ d.platform, COALESCE(d.path, '') as path 205 + \\FROM documents d 206 + \\JOIN publications p ON d.publication_uri = p.uri 207 + \\JOIN publications_fts pf ON p.uri = pf.uri 208 + \\WHERE publications_fts MATCH :query AND d.platform = :platform 209 + \\ORDER BY rank + (julianday('now') - julianday(d.created_at)) / 30.0 LIMIT 40 99 210 ); 100 211 101 212 /// Publication search result (internal) ··· 106 217 snippet: []const u8, 107 218 rkey: []const u8, 108 219 basePath: []const u8, 220 + platform: []const u8, 109 221 110 222 fn fromRow(row: db.Row) Pub { 111 223 return .{ ··· 115 227 .snippet = row.text(3), 116 228 .rkey = row.text(4), 117 229 .basePath = row.text(5), 230 + .platform = row.text(6), 231 + }; 232 + } 233 + 234 + fn fromLocalRow(row: db.LocalDb.Row) Pub { 235 + return .{ 236 + .uri = row.text(0), 237 + .did = row.text(1), 238 + .name = row.text(2), 239 + .snippet = row.text(3), 240 + .rkey = row.text(4), 241 + .basePath = row.text(5), 242 + .platform = row.text(6), 118 243 }; 119 244 } 120 245 ··· 127 252 .snippet = self.snippet, 128 253 .rkey = self.rkey, 129 254 .basePath = self.basePath, 130 - .platform = "leaflet", // publications are leaflet-only for now 255 + .platform = self.platform, 131 256 }; 132 257 } 133 258 }; ··· 135 260 const PubSearch = zql.Query( 136 261 \\SELECT f.uri, p.did, p.name, 137 262 \\ snippet(publications_fts, 2, '', '', '...', 32) as snippet, 138 - \\ p.rkey, p.base_path 263 + \\ p.rkey, p.base_path, p.platform 139 264 \\FROM publications_fts f 140 265 \\JOIN publications p ON f.uri = p.uri 141 266 \\WHERE publications_fts MATCH :query 142 - \\ORDER BY rank LIMIT 10 267 + \\ORDER BY rank + (julianday('now') - julianday(p.created_at)) / 30.0 LIMIT 10 143 268 ); 144 269 145 - pub fn search(alloc: Allocator, query: []const u8, tag_filter: ?[]const u8, platform_filter: ?[]const u8) ![]const u8 { 270 + pub fn search(alloc: Allocator, query: []const u8, tag_filter: ?[]const u8, platform_filter: ?[]const u8, since_filter: ?[]const u8) ![]const u8 { 271 + // try local SQLite first (faster for FTS queries) 272 + if (db.getLocalDb()) |local| { 273 + if (searchLocal(alloc, local, query, tag_filter, platform_filter, since_filter)) |result| { 274 + return result; 275 + } else |err| { 276 + std.debug.print("local search failed ({s}), falling back to turso\n", .{@errorName(err)}); 277 + } 278 + } 279 + 280 + // fall back to Turso 146 281 const c = db.getClient() orelse return error.NotInitialized; 147 282 148 283 var output: std.Io.Writer.Allocating = .init(alloc); ··· 152 287 try jw.beginArray(); 153 288 154 289 const fts_query = try buildFtsQuery(alloc, query); 290 + const has_query = query.len > 0; 291 + const has_tag = tag_filter != null; 292 + const has_platform = platform_filter != null; 293 + const has_since = since_filter != null; 294 + 295 + // track seen URIs for deduplication (content match + base_path match) 296 + var seen_uris = std.StringHashMap(void).init(alloc); 297 + defer seen_uris.deinit(); 298 + 299 + // build batch of queries to execute in single HTTP request 300 + var statements: [3]db.Client.Statement = undefined; 301 + var stmt_count: usize = 0; 302 + 303 + // query 0: documents by content (always present if we have any filter) 304 + const doc_sql = getDocQuerySql(has_query, has_tag, has_platform, has_since); 305 + const doc_args = try getDocQueryArgs(alloc, fts_query, tag_filter, platform_filter, since_filter, has_query, has_tag, has_platform, has_since); 306 + if (doc_sql) |sql| { 307 + statements[stmt_count] = .{ .sql = sql, .args = doc_args }; 308 + stmt_count += 1; 309 + } 310 + 311 + // query 1: documents by publication base_path (subdomain search) 312 + const run_basepath = has_query and !has_tag; 313 + if (run_basepath) { 314 + if (has_platform) { 315 + statements[stmt_count] = .{ .sql = DocsByPubBasePathAndPlatform.positional, .args = &.{ fts_query, platform_filter.? } }; 316 + } else { 317 + statements[stmt_count] = .{ .sql = DocsByPubBasePath.positional, .args = &.{fts_query} }; 318 + } 319 + stmt_count += 1; 320 + } 321 + 322 + // query 2: publications (only when no tag/platform filter) 323 + const run_pubs = tag_filter == null and platform_filter == null and has_query; 324 + if (run_pubs) { 325 + statements[stmt_count] = .{ .sql = PubSearch.positional, .args = &.{fts_query} }; 326 + stmt_count += 1; 327 + } 155 328 156 - // search documents 157 - var doc_result = if (query.len == 0 and tag_filter != null) 158 - c.query(DocsByTag.positional, DocsByTag.bind(.{ .tag = tag_filter.? })) catch null 159 - else if (tag_filter) |tag| 160 - c.query(DocsByFtsAndTag.positional, DocsByFtsAndTag.bind(.{ .query = fts_query, .tag = tag })) catch null 161 - else 162 - c.query(DocsByFts.positional, DocsByFts.bind(.{ .query = fts_query })) catch null; 329 + if (stmt_count == 0) { 330 + try jw.endArray(); 331 + return try output.toOwnedSlice(); 332 + } 333 + 334 + // execute all queries in single HTTP request 335 + var batch = c.queryBatch(statements[0..stmt_count]) catch { 336 + try jw.endArray(); 337 + return try output.toOwnedSlice(); 338 + }; 339 + defer batch.deinit(); 163 340 164 - if (doc_result) |*res| { 165 - defer res.deinit(); 166 - for (res.rows) |row| { 341 + // process query 0: document content results 342 + var query_idx: usize = 0; 343 + if (doc_sql != null) { 344 + for (batch.get(query_idx)) |row| { 167 345 const doc = Doc.fromRow(row); 168 - // filter by platform if specified 169 - if (platform_filter) |pf| { 170 - if (!std.mem.eql(u8, doc.platform, pf)) continue; 171 - } 346 + const uri_dupe = try alloc.dupe(u8, doc.uri); 347 + try seen_uris.put(uri_dupe, {}); 172 348 try jw.write(doc.toJson()); 173 349 } 350 + query_idx += 1; 174 351 } 175 352 176 - // publications are excluded when filtering by tag or platform (only leaflet has publications) 177 - if (tag_filter == null and (platform_filter == null or std.mem.eql(u8, platform_filter.?, "leaflet"))) { 178 - var pub_result = c.query( 179 - PubSearch.positional, 180 - PubSearch.bind(.{ .query = fts_query }), 181 - ) catch null; 353 + // process query 1: base_path results (deduplicated) 354 + if (run_basepath) { 355 + for (batch.get(query_idx)) |row| { 356 + const doc = Doc.fromRow(row); 357 + if (!seen_uris.contains(doc.uri)) { 358 + try jw.write(doc.toJson()); 359 + } 360 + } 361 + query_idx += 1; 362 + } 182 363 183 - if (pub_result) |*res| { 184 - defer res.deinit(); 185 - for (res.rows) |row| try jw.write(Pub.fromRow(row).toJson()); 364 + // process query 2: publication results 365 + if (run_pubs) { 366 + for (batch.get(query_idx)) |row| { 367 + try jw.write(Pub.fromRow(row).toJson()); 186 368 } 187 369 } 188 370 ··· 190 372 return try output.toOwnedSlice(); 191 373 } 192 374 375 + /// Local SQLite search (FTS queries only, no vector similarity) 376 + /// Simplified version - just handles basic FTS query case to get started 377 + fn searchLocal(alloc: Allocator, local: *db.LocalDb, query: []const u8, tag_filter: ?[]const u8, platform_filter: ?[]const u8, since_filter: ?[]const u8) ![]const u8 { 378 + // only handle basic FTS queries for now (most common case) 379 + // fall back to Turso for complex filter combinations 380 + if (query.len == 0 or tag_filter != null or since_filter != null) { 381 + return error.UnsupportedQuery; 382 + } 383 + 384 + var output: std.Io.Writer.Allocating = .init(alloc); 385 + errdefer output.deinit(); 386 + 387 + var jw: json.Stringify = .{ .writer = &output.writer }; 388 + try jw.beginArray(); 389 + 390 + const fts_query = try buildFtsQuery(alloc, query); 391 + 392 + // track seen URIs for deduplication 393 + var seen_uris = std.StringHashMap(void).init(alloc); 394 + defer seen_uris.deinit(); 395 + 396 + // document content search 397 + if (platform_filter) |platform| { 398 + var rows = try local.query( 399 + \\SELECT f.uri, d.did, d.title, 400 + \\ snippet(documents_fts, 2, '', '', '...', 32) as snippet, 401 + \\ d.created_at, d.rkey, d.base_path, d.has_publication, 402 + \\ d.platform, COALESCE(d.path, '') as path 403 + \\FROM documents_fts f 404 + \\JOIN documents d ON f.uri = d.uri 405 + \\WHERE documents_fts MATCH ? AND d.platform = ? 406 + \\ORDER BY rank LIMIT 40 407 + , .{ fts_query, platform }); 408 + defer rows.deinit(); 409 + 410 + while (rows.next()) |row| { 411 + const doc = Doc.fromLocalRow(row); 412 + const uri_dupe = try alloc.dupe(u8, doc.uri); 413 + try seen_uris.put(uri_dupe, {}); 414 + try jw.write(doc.toJson()); 415 + } 416 + 417 + // base_path search with platform 418 + var bp_rows = try local.query( 419 + \\SELECT d.uri, d.did, d.title, '' as snippet, 420 + \\ d.created_at, d.rkey, p.base_path, 421 + \\ 1 as has_publication, d.platform, COALESCE(d.path, '') as path 422 + \\FROM documents d 423 + \\JOIN publications p ON d.publication_uri = p.uri 424 + \\JOIN publications_fts pf ON p.uri = pf.uri 425 + \\WHERE publications_fts MATCH ? AND d.platform = ? 426 + \\ORDER BY rank LIMIT 40 427 + , .{ fts_query, platform }); 428 + defer bp_rows.deinit(); 429 + 430 + while (bp_rows.next()) |row| { 431 + const doc = Doc.fromLocalRow(row); 432 + if (!seen_uris.contains(doc.uri)) { 433 + try jw.write(doc.toJson()); 434 + } 435 + } 436 + } else { 437 + // no platform filter 438 + var rows = try local.query( 439 + \\SELECT f.uri, d.did, d.title, 440 + \\ snippet(documents_fts, 2, '', '', '...', 32) as snippet, 441 + \\ d.created_at, d.rkey, d.base_path, d.has_publication, 442 + \\ d.platform, COALESCE(d.path, '') as path 443 + \\FROM documents_fts f 444 + \\JOIN documents d ON f.uri = d.uri 445 + \\WHERE documents_fts MATCH ? 446 + \\ORDER BY rank LIMIT 40 447 + , .{fts_query}); 448 + defer rows.deinit(); 449 + 450 + while (rows.next()) |row| { 451 + const doc = Doc.fromLocalRow(row); 452 + const uri_dupe = try alloc.dupe(u8, doc.uri); 453 + try seen_uris.put(uri_dupe, {}); 454 + try jw.write(doc.toJson()); 455 + } 456 + 457 + // base_path search 458 + var bp_rows = try local.query( 459 + \\SELECT d.uri, d.did, d.title, '' as snippet, 460 + \\ d.created_at, d.rkey, p.base_path, 461 + \\ 1 as has_publication, d.platform, COALESCE(d.path, '') as path 462 + \\FROM documents d 463 + \\JOIN publications p ON d.publication_uri = p.uri 464 + \\JOIN publications_fts pf ON p.uri = pf.uri 465 + \\WHERE publications_fts MATCH ? 466 + \\ORDER BY rank LIMIT 40 467 + , .{fts_query}); 468 + defer bp_rows.deinit(); 469 + 470 + while (bp_rows.next()) |row| { 471 + const doc = Doc.fromLocalRow(row); 472 + if (!seen_uris.contains(doc.uri)) { 473 + try jw.write(doc.toJson()); 474 + } 475 + } 476 + 477 + // publication search 478 + var pub_rows = try local.query( 479 + \\SELECT f.uri, p.did, p.name, 480 + \\ snippet(publications_fts, 2, '', '', '...', 32) as snippet, 481 + \\ p.rkey, p.base_path, p.platform 482 + \\FROM publications_fts f 483 + \\JOIN publications p ON f.uri = p.uri 484 + \\WHERE publications_fts MATCH ? 485 + \\ORDER BY rank LIMIT 10 486 + , .{fts_query}); 487 + defer pub_rows.deinit(); 488 + 489 + while (pub_rows.next()) |row| { 490 + try jw.write(Pub.fromLocalRow(row).toJson()); 491 + } 492 + } 493 + 494 + try jw.endArray(); 495 + return try output.toOwnedSlice(); 496 + } 497 + 498 + fn getDocQuerySql(has_query: bool, has_tag: bool, has_platform: bool, has_since: bool) ?[]const u8 { 499 + if (has_query and has_tag and has_platform) return DocsByFtsAndTagAndPlatform.positional; 500 + if (has_query and has_tag) return DocsByFtsAndTag.positional; 501 + if (has_query and has_platform and has_since) return DocsByFtsAndPlatformAndSince.positional; 502 + if (has_query and has_platform) return DocsByFtsAndPlatform.positional; 503 + if (has_query and has_since) return DocsByFtsAndSince.positional; 504 + if (has_query) return DocsByFts.positional; 505 + if (has_tag and has_platform) return DocsByTagAndPlatform.positional; 506 + if (has_tag) return DocsByTag.positional; 507 + if (has_platform) return DocsByPlatform.positional; 508 + return null; 509 + } 510 + 511 + fn getDocQueryArgs(alloc: Allocator, fts_query: []const u8, tag: ?[]const u8, platform: ?[]const u8, since: ?[]const u8, has_query: bool, has_tag: bool, has_platform: bool, has_since: bool) ![]const []const u8 { 512 + if (has_query and has_tag and has_platform) { 513 + const args = try alloc.alloc([]const u8, 3); 514 + args[0] = fts_query; 515 + args[1] = tag.?; 516 + args[2] = platform.?; 517 + return args; 518 + } 519 + if (has_query and has_tag) { 520 + const args = try alloc.alloc([]const u8, 2); 521 + args[0] = fts_query; 522 + args[1] = tag.?; 523 + return args; 524 + } 525 + if (has_query and has_platform and has_since) { 526 + const args = try alloc.alloc([]const u8, 3); 527 + args[0] = fts_query; 528 + args[1] = platform.?; 529 + args[2] = since.?; 530 + return args; 531 + } 532 + if (has_query and has_platform) { 533 + const args = try alloc.alloc([]const u8, 2); 534 + args[0] = fts_query; 535 + args[1] = platform.?; 536 + return args; 537 + } 538 + if (has_query and has_since) { 539 + const args = try alloc.alloc([]const u8, 2); 540 + args[0] = fts_query; 541 + args[1] = since.?; 542 + return args; 543 + } 544 + if (has_query) { 545 + const args = try alloc.alloc([]const u8, 1); 546 + args[0] = fts_query; 547 + return args; 548 + } 549 + if (has_tag and has_platform) { 550 + const args = try alloc.alloc([]const u8, 2); 551 + args[0] = tag.?; 552 + args[1] = platform.?; 553 + return args; 554 + } 555 + if (has_tag) { 556 + const args = try alloc.alloc([]const u8, 1); 557 + args[0] = tag.?; 558 + return args; 559 + } 560 + if (has_platform) { 561 + const args = try alloc.alloc([]const u8, 1); 562 + args[0] = platform.?; 563 + return args; 564 + } 565 + return &.{}; 566 + } 567 + 193 568 /// Find documents similar to a given document using vector similarity 194 569 /// Uses brute-force cosine distance with caching (cache invalidated when doc count changes) 195 570 pub fn findSimilar(alloc: Allocator, uri: []const u8, limit: usize) ![]const u8 { 196 571 const c = db.getClient() orelse return error.NotInitialized; 197 572 198 - // get current doc count (for cache invalidation) 199 - const doc_count = getEmbeddedDocCount(c) orelse return error.QueryFailed; 573 + // get cached doc count (rarely hits Turso - refreshes every 5 min) 574 + const doc_count = getEmbeddedDocCountCached(c) orelse return error.QueryFailed; 200 575 201 - // check cache 576 + // check LOCAL cache first (instant) 577 + if (db.getLocalDb()) |local| { 578 + if (getCachedSimilarLocal(alloc, local, uri, doc_count)) |cached| { 579 + stats.recordCacheHit(); 580 + return cached; 581 + } 582 + } 583 + 584 + // check Turso cache (slower, but needed if local empty) 202 585 if (getCachedSimilar(alloc, c, uri, doc_count)) |cached| { 203 586 stats.recordCacheHit(); 587 + // also write to local cache for next time 588 + if (db.getLocalDb()) |local| { 589 + cacheSimilarResultsLocal(local, uri, cached, doc_count); 590 + } 204 591 return cached; 205 592 } 206 593 stats.recordCacheMiss(); ··· 215 602 // brute-force cosine similarity search (no vector index needed) 216 603 var res = c.query( 217 604 \\SELECT d2.uri, d2.did, d2.title, '' as snippet, 218 - \\ d2.created_at, d2.rkey, COALESCE(p.base_path, '') as base_path, 219 - \\ CASE WHEN d2.publication_uri != '' THEN 1 ELSE 0 END as has_publication, 220 - \\ d2.platform 605 + \\ d2.created_at, d2.rkey, d2.base_path, d2.has_publication, 606 + \\ d2.platform, COALESCE(d2.path, '') as path 221 607 \\FROM documents d1, documents d2 222 - \\LEFT JOIN publications p ON d2.publication_uri = p.uri 223 608 \\WHERE d1.uri = ? 224 609 \\ AND d2.uri != d1.uri 225 610 \\ AND d1.embedding IS NOT NULL ··· 239 624 240 625 const results = try output.toOwnedSlice(); 241 626 242 - // cache the results (fire and forget) 627 + // cache to LOCAL db (instant) 628 + if (db.getLocalDb()) |local| { 629 + cacheSimilarResultsLocal(local, uri, results, doc_count); 630 + } 631 + 632 + // cache to Turso (fire and forget - still useful for durability) 243 633 cacheSimilarResults(c, uri, results, doc_count); 244 634 245 635 return results; ··· 250 640 defer res.deinit(); 251 641 if (res.rows.len == 0) return null; 252 642 return res.rows[0].int(0); 643 + } 644 + 645 + fn getEmbeddedDocCountCached(c: *db.Client) ?i64 { 646 + const now = std.time.timestamp(); 647 + const last_update = doc_count_updated_at.load(.acquire); 648 + 649 + // use cached value if fresh enough 650 + if (now - last_update < DOC_COUNT_CACHE_SECS) { 651 + const cached = cached_doc_count.load(.acquire); 652 + if (cached > 0) return cached; 653 + } 654 + 655 + // refresh from Turso 656 + const count = getEmbeddedDocCount(c) orelse return null; 657 + cached_doc_count.store(count, .release); 658 + doc_count_updated_at.store(now, .release); 659 + return count; 253 660 } 254 661 255 662 fn getCachedSimilar(alloc: Allocator, c: *db.Client, uri: []const u8, current_doc_count: i64) ?[]const u8 { ··· 279 686 ) catch {}; 280 687 } 281 688 689 + fn getCachedSimilarLocal(alloc: Allocator, local: *db.LocalDb, uri: []const u8, current_doc_count: i64) ?[]const u8 { 690 + var rows = local.query( 691 + "SELECT results, doc_count FROM similarity_cache WHERE source_uri = ?", 692 + .{uri}, 693 + ) catch return null; 694 + defer rows.deinit(); 695 + 696 + const row = rows.next() orelse return null; 697 + // check doc_count matches for cache validity 698 + if (row.int(1) != current_doc_count) return null; 699 + return alloc.dupe(u8, row.text(0)) catch null; 700 + } 701 + 702 + fn cacheSimilarResultsLocal(local: *db.LocalDb, uri: []const u8, results: []const u8, doc_count: i64) void { 703 + var count_buf: [20]u8 = undefined; 704 + const count_str = std.fmt.bufPrint(&count_buf, "{d}", .{doc_count}) catch return; 705 + 706 + var ts_buf: [20]u8 = undefined; 707 + const ts_str = std.fmt.bufPrint(&ts_buf, "{d}", .{std.time.timestamp()}) catch return; 708 + 709 + local.exec( 710 + "INSERT OR REPLACE INTO similarity_cache (source_uri, results, doc_count, computed_at) VALUES (?, ?, ?, ?)", 711 + .{ uri, results, count_str, ts_str }, 712 + ) catch {}; 713 + } 714 + 282 715 /// Build FTS5 query with OR between terms: "cat dog" -> "cat OR dog*" 283 716 /// Uses OR for better recall with BM25 ranking (more matches = higher score) 284 717 /// Quoted queries are passed through as phrase matches: "exact phrase" -> "exact phrase" 718 + /// Separators match FTS5 unicode61 tokenizer: any non-alphanumeric character 285 719 pub fn buildFtsQuery(alloc: Allocator, query: []const u8) ![]const u8 { 286 720 if (query.len == 0) return ""; 287 721 ··· 300 734 } 301 735 302 736 // count words and total length 737 + // match FTS5 unicode61 tokenizer: non-alphanumeric = separator 303 738 var word_count: usize = 0; 304 739 var total_word_len: usize = 0; 305 740 var in_word = false; 306 741 for (trimmed) |c| { 307 - const is_sep = (c == ' ' or c == '.'); 308 - if (is_sep) { 742 + const is_alnum = (c >= 'a' and c <= 'z') or (c >= 'A' and c <= 'Z') or (c >= '0' and c <= '9'); 743 + if (!is_alnum) { 309 744 in_word = false; 310 745 } else { 311 746 if (!in_word) word_count += 1; ··· 321 756 const buf = try alloc.alloc(u8, total_word_len + 1); 322 757 var pos: usize = 0; 323 758 for (trimmed) |c| { 324 - if (c != ' ' and c != '.') { 759 + const is_alnum = (c >= 'a' and c <= 'z') or (c >= 'A' and c <= 'Z') or (c >= '0' and c <= '9'); 760 + if (is_alnum) { 325 761 buf[pos] = c; 326 762 pos += 1; 327 763 } ··· 340 776 in_word = false; 341 777 342 778 for (trimmed) |c| { 343 - const is_sep = (c == ' ' or c == '.'); 344 - if (is_sep) { 779 + const is_alnum = (c >= 'a' and c <= 'z') or (c >= 'A' and c <= 'Z') or (c >= '0' and c <= '9'); 780 + if (!is_alnum) { 345 781 if (in_word) { 346 782 // end of word - add " OR " if not last 347 783 current_word += 1; ··· 408 844 defer std.testing.allocator.free(result); 409 845 try std.testing.expectEqualStrings("foo OR bar*", result); 410 846 } 847 + 848 + test "buildFtsQuery: hyphens as separators" { 849 + const result = try buildFtsQuery(std.testing.allocator, "crypto-casino"); 850 + defer std.testing.allocator.free(result); 851 + try std.testing.expectEqualStrings("crypto OR casino*", result); 852 + } 853 + 854 + test "buildFtsQuery: mixed punctuation" { 855 + const result = try buildFtsQuery(std.testing.allocator, "don't@stop_now"); 856 + defer std.testing.allocator.free(result); 857 + try std.testing.expectEqualStrings("don OR t OR stop OR now*", result); 858 + }

+103 -9

backend/src/server.zig

··· 2 2 const net = std.net; 3 3 const http = std.http; 4 4 const mem = std.mem; 5 + const json = std.json; 6 + const logfire = @import("logfire"); 5 7 const activity = @import("activity.zig"); 6 8 const search = @import("search.zig"); 7 9 const stats = @import("stats.zig"); 10 + const timing = @import("timing.zig"); 8 11 const dashboard = @import("dashboard.zig"); 9 12 10 13 const HTTP_BUF_SIZE = 8192; ··· 24 27 while (true) { 25 28 var request = server.receiveHead() catch |err| { 26 29 if (err != error.HttpConnectionClosing and err != error.EndOfStream) { 27 - std.debug.print("http receive error: {}\n", .{err}); 30 + logfire.debug("http receive error: {}", .{err}); 28 31 } 29 32 return; 30 33 }; 31 34 handleRequest(&server, &request) catch |err| { 32 - std.debug.print("request error: {}\n", .{err}); 35 + logfire.err("request error: {}", .{err}); 33 36 return; 34 37 }; 35 38 if (!request.head.keep_alive) return; ··· 56 59 try sendJson(request, "{\"status\":\"ok\"}"); 57 60 } else if (mem.eql(u8, target, "/popular")) { 58 61 try handlePopular(request); 62 + } else if (mem.eql(u8, target, "/platforms")) { 63 + try handlePlatforms(request); 59 64 } else if (mem.eql(u8, target, "/dashboard")) { 60 65 try handleDashboard(request); 61 66 } else if (mem.eql(u8, target, "/api/dashboard")) { ··· 70 75 } 71 76 72 77 fn handleSearch(request: *http.Server.Request, target: []const u8) !void { 78 + const start_time = std.time.microTimestamp(); 79 + defer timing.record(.search, start_time); 80 + 73 81 var arena = std.heap.ArenaAllocator.init(std.heap.page_allocator); 74 82 defer arena.deinit(); 75 83 const alloc = arena.allocator(); 76 84 77 - // parse query params: /search?q=something&tag=foo&platform=leaflet 78 85 const query = parseQueryParam(alloc, target, "q") catch ""; 79 86 const tag_filter = parseQueryParam(alloc, target, "tag") catch null; 80 87 const platform_filter = parseQueryParam(alloc, target, "platform") catch null; 88 + const since_filter = parseQueryParam(alloc, target, "since") catch null; 89 + 90 + // span attributes are now copied internally, safe to use arena strings 91 + const span = logfire.span("http.search", .{ 92 + .query = query, 93 + .tag = tag_filter, 94 + .platform = platform_filter, 95 + }); 96 + defer span.end(); 81 97 82 98 if (query.len == 0 and tag_filter == null) { 83 99 try sendJson(request, "{\"error\":\"enter a search term\"}"); ··· 85 101 } 86 102 87 103 // perform FTS search - arena handles cleanup 88 - const results = search.search(alloc, query, tag_filter, platform_filter) catch |err| { 104 + const results = search.search(alloc, query, tag_filter, platform_filter, since_filter) catch |err| { 105 + logfire.err("search failed: {}", .{err}); 89 106 stats.recordError(); 90 107 return err; 91 108 }; 92 109 stats.recordSearch(query); 110 + logfire.counter("search.requests", 1); 93 111 try sendJson(request, results); 94 112 } 95 113 96 114 fn handleTags(request: *http.Server.Request) !void { 115 + const start_time = std.time.microTimestamp(); 116 + defer timing.record(.tags, start_time); 117 + 118 + const span = logfire.span("http.tags", .{}); 119 + defer span.end(); 120 + 97 121 var arena = std.heap.ArenaAllocator.init(std.heap.page_allocator); 98 122 defer arena.deinit(); 99 123 const alloc = arena.allocator(); ··· 103 127 } 104 128 105 129 fn handlePopular(request: *http.Server.Request) !void { 130 + const start_time = std.time.microTimestamp(); 131 + defer timing.record(.popular, start_time); 132 + 133 + const span = logfire.span("http.popular", .{}); 134 + defer span.end(); 135 + 106 136 var arena = std.heap.ArenaAllocator.init(std.heap.page_allocator); 107 137 defer arena.deinit(); 108 138 const alloc = arena.allocator(); ··· 111 141 try sendJson(request, popular); 112 142 } 113 143 144 + fn handlePlatforms(request: *http.Server.Request) !void { 145 + var arena = std.heap.ArenaAllocator.init(std.heap.page_allocator); 146 + defer arena.deinit(); 147 + const alloc = arena.allocator(); 148 + 149 + const data = try stats.getPlatformCounts(alloc); 150 + try sendJson(request, data); 151 + } 152 + 114 153 fn parseQueryParam(alloc: std.mem.Allocator, target: []const u8, param: []const u8) ![]const u8 { 115 154 // look for ?param= or &param= 116 155 const patterns = [_][]const u8{ "?", "&" }; ··· 138 177 const alloc = arena.allocator(); 139 178 140 179 const db_stats = stats.getStats(); 180 + const all_timing = timing.getAllStats(); 141 181 142 - var response: std.ArrayList(u8) = .{}; 143 - defer response.deinit(alloc); 182 + var output: std.Io.Writer.Allocating = .init(alloc); 183 + errdefer output.deinit(); 144 184 145 - try response.print(alloc, "{{\"documents\":{d},\"publications\":{d},\"cache_hits\":{d},\"cache_misses\":{d}}}", .{ db_stats.documents, db_stats.publications, db_stats.cache_hits, db_stats.cache_misses }); 185 + var jw: json.Stringify = .{ .writer = &output.writer }; 186 + try jw.beginObject(); 146 187 147 - try sendJson(request, response.items); 188 + // db stats 189 + try jw.objectField("documents"); 190 + try jw.write(db_stats.documents); 191 + try jw.objectField("publications"); 192 + try jw.write(db_stats.publications); 193 + try jw.objectField("embeddings"); 194 + try jw.write(db_stats.embeddings); 195 + try jw.objectField("searches"); 196 + try jw.write(db_stats.searches); 197 + try jw.objectField("errors"); 198 + try jw.write(db_stats.errors); 199 + try jw.objectField("cache_hits"); 200 + try jw.write(db_stats.cache_hits); 201 + try jw.objectField("cache_misses"); 202 + try jw.write(db_stats.cache_misses); 203 + 204 + // timing stats per endpoint 205 + try jw.objectField("timing"); 206 + try jw.beginObject(); 207 + inline for (@typeInfo(timing.Endpoint).@"enum".fields, 0..) |field, i| { 208 + const t = all_timing[i]; 209 + try jw.objectField(field.name); 210 + try jw.beginObject(); 211 + try jw.objectField("count"); 212 + try jw.write(t.count); 213 + try jw.objectField("avg_ms"); 214 + try jw.write(t.avg_ms); 215 + try jw.objectField("p50_ms"); 216 + try jw.write(t.p50_ms); 217 + try jw.objectField("p95_ms"); 218 + try jw.write(t.p95_ms); 219 + try jw.objectField("p99_ms"); 220 + try jw.write(t.p99_ms); 221 + try jw.objectField("max_ms"); 222 + try jw.write(t.max_ms); 223 + try jw.endObject(); 224 + } 225 + try jw.endObject(); 226 + 227 + try jw.endObject(); 228 + 229 + try sendJson(request, try output.toOwnedSlice()); 148 230 } 149 231 150 232 fn sendJson(request: *http.Server.Request, body: []const u8) !void { ··· 198 280 try sendJson(request, json_response); 199 281 } 200 282 283 + fn getDashboardUrl() []const u8 { 284 + return std.posix.getenv("DASHBOARD_URL") orelse "https://leaflet-search.pages.dev/dashboard.html"; 285 + } 286 + 201 287 fn handleDashboard(request: *http.Server.Request) !void { 288 + const dashboard_url = getDashboardUrl(); 202 289 try request.respond("", .{ 203 290 .status = .moved_permanently, 204 291 .extra_headers = &.{ 205 - .{ .name = "location", .value = "https://leaflet-search.pages.dev/dashboard.html" }, 292 + .{ .name = "location", .value = dashboard_url }, 206 293 }, 207 294 }); 208 295 } 209 296 210 297 fn handleSimilar(request: *http.Server.Request, target: []const u8) !void { 298 + const start_time = std.time.microTimestamp(); 299 + defer timing.record(.similar, start_time); 300 + 211 301 var arena = std.heap.ArenaAllocator.init(std.heap.page_allocator); 212 302 defer arena.deinit(); 213 303 const alloc = arena.allocator(); ··· 216 306 try sendJson(request, "{\"error\":\"missing uri parameter\"}"); 217 307 return; 218 308 }; 309 + 310 + // span attributes are copied internally, safe to use arena strings 311 + const span = logfire.span("http.similar", .{ .uri = uri }); 312 + defer span.end(); 219 313 220 314 const results = search.findSimilar(alloc, uri, 5) catch { 221 315 try sendJson(request, "[]");

+171 -25

backend/src/stats.zig

··· 4 4 const zql = @import("zql"); 5 5 const db = @import("db/mod.zig"); 6 6 const activity = @import("activity.zig"); 7 + const stats_buffer = @import("stats_buffer.zig"); 7 8 8 9 const TagJson = struct { tag: []const u8, count: i64 }; 9 10 const PopularJson = struct { query: []const u8, count: i64 }; ··· 17 18 ); 18 19 19 20 pub fn getTags(alloc: Allocator) ![]const u8 { 21 + // try local SQLite first (faster) 22 + if (db.getLocalDb()) |local| { 23 + if (getTagsLocal(alloc, local)) |result| { 24 + return result; 25 + } else |_| {} 26 + } 27 + 28 + // fall back to Turso 20 29 const c = db.getClient() orelse return error.NotInitialized; 21 30 22 31 var output: std.Io.Writer.Allocating = .init(alloc); ··· 35 44 return try output.toOwnedSlice(); 36 45 } 37 46 47 + fn getTagsLocal(alloc: Allocator, local: *db.LocalDb) ![]const u8 { 48 + var output: std.Io.Writer.Allocating = .init(alloc); 49 + errdefer output.deinit(); 50 + 51 + var rows = try local.query( 52 + \\SELECT tag, COUNT(*) as count 53 + \\FROM document_tags 54 + \\GROUP BY tag 55 + \\ORDER BY count DESC 56 + \\LIMIT 100 57 + , .{}); 58 + defer rows.deinit(); 59 + 60 + var jw: json.Stringify = .{ .writer = &output.writer }; 61 + try jw.beginArray(); 62 + while (rows.next()) |row| { 63 + try jw.write(TagJson{ .tag = row.text(0), .count = row.int(1) }); 64 + } 65 + try jw.endArray(); 66 + return try output.toOwnedSlice(); 67 + } 68 + 38 69 pub const Stats = struct { 39 70 documents: i64, 40 71 publications: i64, 72 + embeddings: i64, 41 73 searches: i64, 42 74 errors: i64, 43 75 started_at: i64, ··· 45 77 cache_misses: i64, 46 78 }; 47 79 80 + const default_stats: Stats = .{ .documents = 0, .publications = 0, .embeddings = 0, .searches = 0, .errors = 0, .started_at = 0, .cache_hits = 0, .cache_misses = 0 }; 81 + 48 82 pub fn getStats() Stats { 49 - const c = db.getClient() orelse return .{ .documents = 0, .publications = 0, .searches = 0, .errors = 0, .started_at = 0, .cache_hits = 0, .cache_misses = 0 }; 83 + // try local SQLite first (fast) 84 + if (db.getLocalDb()) |local| { 85 + if (getStatsLocal(local)) |result| { 86 + return result; 87 + } else |_| {} 88 + } 89 + 90 + // fall back to Turso (slow) 91 + const c = db.getClient() orelse return default_stats; 50 92 51 93 var res = c.query( 52 94 \\SELECT 53 95 \\ (SELECT COUNT(*) FROM documents) as docs, 54 96 \\ (SELECT COUNT(*) FROM publications) as pubs, 97 + \\ (SELECT COUNT(*) FROM documents WHERE embedding IS NOT NULL) as embeddings, 55 98 \\ (SELECT total_searches FROM stats WHERE id = 1) as searches, 56 99 \\ (SELECT total_errors FROM stats WHERE id = 1) as errors, 57 100 \\ (SELECT service_started_at FROM stats WHERE id = 1) as started_at, 58 101 \\ (SELECT COALESCE(cache_hits, 0) FROM stats WHERE id = 1) as cache_hits, 59 102 \\ (SELECT COALESCE(cache_misses, 0) FROM stats WHERE id = 1) as cache_misses 60 - , &.{}) catch return .{ .documents = 0, .publications = 0, .searches = 0, .errors = 0, .started_at = 0, .cache_hits = 0, .cache_misses = 0 }; 103 + , &.{}) catch return default_stats; 61 104 defer res.deinit(); 62 105 63 - const row = res.first() orelse return .{ .documents = 0, .publications = 0, .searches = 0, .errors = 0, .started_at = 0, .cache_hits = 0, .cache_misses = 0 }; 106 + const row = res.first() orelse return default_stats; 107 + // include pending deltas from buffer 64 108 return .{ 65 109 .documents = row.int(0), 66 110 .publications = row.int(1), 67 - .searches = row.int(2), 68 - .errors = row.int(3), 69 - .started_at = row.int(4), 70 - .cache_hits = row.int(5), 71 - .cache_misses = row.int(6), 111 + .embeddings = row.int(2), 112 + .searches = stats_buffer.getSearchCount(row.int(3)), 113 + .errors = stats_buffer.getErrorCount(row.int(4)), 114 + .started_at = row.int(5), 115 + .cache_hits = stats_buffer.getCacheHitCount(row.int(6)), 116 + .cache_misses = stats_buffer.getCacheMissCount(row.int(7)), 72 117 }; 73 118 } 74 119 75 - pub fn recordSearch(query: []const u8) void { 76 - const c = db.getClient() orelse return; 120 + fn getStatsLocal(local: *db.LocalDb) !Stats { 121 + // get stats table from Turso (doesn't sync to local replica) 122 + const client = db.getClient() orelse return error.NotInitialized; 123 + var stats_res = client.query( 124 + \\SELECT total_searches, total_errors, service_started_at, 125 + \\ COALESCE(cache_hits, 0), COALESCE(cache_misses, 0) 126 + \\FROM stats WHERE id = 1 127 + , &.{}) catch return error.QueryFailed; 128 + defer stats_res.deinit(); 129 + const stats_row = stats_res.first() orelse return error.NoStats; 77 130 78 - activity.record(); 79 - c.exec("UPDATE stats SET total_searches = total_searches + 1 WHERE id = 1", &.{}) catch {}; 131 + // get document counts from local (fast) 132 + var rows = try local.query( 133 + \\SELECT 134 + \\ (SELECT COUNT(*) FROM documents) as docs, 135 + \\ (SELECT COUNT(*) FROM publications) as pubs, 136 + \\ (SELECT COUNT(*) FROM documents WHERE embedding IS NOT NULL) as embeddings 137 + , .{}); 138 + defer rows.deinit(); 139 + const row = rows.next() orelse return error.NoRows; 80 140 81 - // track popular searches (skip empty/very short queries) 82 - if (query.len >= 2) { 83 - c.exec( 84 - "INSERT INTO popular_searches (query, count) VALUES (?, 1) ON CONFLICT(query) DO UPDATE SET count = count + 1", 85 - &.{query}, 86 - ) catch {}; 87 - } 141 + // include pending deltas from buffer 142 + return .{ 143 + .documents = row.int(0), 144 + .publications = row.int(1), 145 + .embeddings = row.int(2), 146 + .searches = stats_buffer.getSearchCount(stats_row.int(0)), 147 + .errors = stats_buffer.getErrorCount(stats_row.int(1)), 148 + .started_at = stats_row.int(2), 149 + .cache_hits = stats_buffer.getCacheHitCount(stats_row.int(3)), 150 + .cache_misses = stats_buffer.getCacheMissCount(stats_row.int(4)), 151 + }; 152 + } 153 + 154 + pub fn recordSearch(query: []const u8) void { 155 + activity.record(); 156 + stats_buffer.recordSearch(); 157 + stats_buffer.queuePopularSearch(query); 88 158 } 89 159 90 160 pub fn recordError() void { 91 - const c = db.getClient() orelse return; 92 - c.exec("UPDATE stats SET total_errors = total_errors + 1 WHERE id = 1", &.{}) catch {}; 161 + stats_buffer.recordError(); 93 162 } 94 163 95 164 pub fn recordCacheHit() void { 96 - const c = db.getClient() orelse return; 97 - c.exec("UPDATE stats SET cache_hits = COALESCE(cache_hits, 0) + 1 WHERE id = 1", &.{}) catch {}; 165 + stats_buffer.recordCacheHit(); 98 166 } 99 167 100 168 pub fn recordCacheMiss() void { 101 - const c = db.getClient() orelse return; 102 - c.exec("UPDATE stats SET cache_misses = COALESCE(cache_misses, 0) + 1 WHERE id = 1", &.{}) catch {}; 169 + stats_buffer.recordCacheMiss(); 170 + } 171 + 172 + const PlatformCount = struct { platform: []const u8, count: i64 }; 173 + 174 + pub fn getPlatformCounts(alloc: Allocator) ![]const u8 { 175 + const c = db.getClient() orelse return error.NotInitialized; 176 + 177 + var output: std.Io.Writer.Allocating = .init(alloc); 178 + errdefer output.deinit(); 179 + 180 + var jw: json.Stringify = .{ .writer = &output.writer }; 181 + try jw.beginObject(); 182 + 183 + // documents by platform 184 + try jw.objectField("documents"); 185 + if (c.query("SELECT platform, COUNT(*) as count FROM documents GROUP BY platform ORDER BY count DESC", &.{})) |res_val| { 186 + var res = res_val; 187 + defer res.deinit(); 188 + try jw.beginArray(); 189 + for (res.rows) |row| try jw.write(PlatformCount{ .platform = row.text(0), .count = row.int(1) }); 190 + try jw.endArray(); 191 + } else |_| { 192 + try jw.beginArray(); 193 + try jw.endArray(); 194 + } 195 + 196 + // FTS document count 197 + try jw.objectField("fts_count"); 198 + if (c.query("SELECT COUNT(*) FROM documents_fts", &.{})) |res_val| { 199 + var res = res_val; 200 + defer res.deinit(); 201 + if (res.first()) |row| { 202 + try jw.write(row.int(0)); 203 + } else try jw.write(0); 204 + } else |_| try jw.write(0); 205 + 206 + // sample URIs from each platform (for debugging) 207 + try jw.objectField("sample_other"); 208 + if (c.query("SELECT uri FROM documents WHERE platform = 'other' LIMIT 3", &.{})) |res_val| { 209 + var res = res_val; 210 + defer res.deinit(); 211 + try jw.beginArray(); 212 + for (res.rows) |row| try jw.write(row.text(0)); 213 + try jw.endArray(); 214 + } else |_| { 215 + try jw.beginArray(); 216 + try jw.endArray(); 217 + } 218 + 219 + try jw.endObject(); 220 + return try output.toOwnedSlice(); 103 221 } 104 222 105 223 pub fn getPopular(alloc: Allocator, limit: usize) ![]const u8 { 224 + // try local SQLite first (faster) 225 + if (db.getLocalDb()) |local| { 226 + if (getPopularLocal(alloc, local, limit)) |result| { 227 + return result; 228 + } else |_| {} 229 + } 230 + 231 + // fall back to Turso 106 232 const c = db.getClient() orelse return error.NotInitialized; 107 233 108 234 var output: std.Io.Writer.Allocating = .init(alloc); ··· 126 252 try jw.endArray(); 127 253 return try output.toOwnedSlice(); 128 254 } 255 + 256 + fn getPopularLocal(alloc: Allocator, local: *db.LocalDb, limit: usize) ![]const u8 { 257 + var output: std.Io.Writer.Allocating = .init(alloc); 258 + errdefer output.deinit(); 259 + 260 + _ = limit; // zqlite doesn't support runtime LIMIT, use fixed 10 261 + var rows = try local.query( 262 + "SELECT query, count FROM popular_searches ORDER BY count DESC LIMIT 10", 263 + .{}, 264 + ); 265 + defer rows.deinit(); 266 + 267 + var jw: json.Stringify = .{ .writer = &output.writer }; 268 + try jw.beginArray(); 269 + while (rows.next()) |row| { 270 + try jw.write(PopularJson{ .query = row.text(0), .count = row.int(1) }); 271 + } 272 + try jw.endArray(); 273 + return try output.toOwnedSlice(); 274 + }

+172

backend/src/stats_buffer.zig

··· 1 + //! Buffered stats with background sync to Turso 2 + //! Follows activity.zig pattern: instant local writes, periodic remote sync 3 + 4 + const std = @import("std"); 5 + const db = @import("db/mod.zig"); 6 + const logfire = @import("logfire"); 7 + 8 + const SYNC_INTERVAL_MS = 5000; // 5 seconds 9 + const MAX_PENDING_SEARCHES = 256; 10 + 11 + // atomic deltas (since last sync) 12 + var delta_searches: std.atomic.Value(i64) = std.atomic.Value(i64).init(0); 13 + var delta_errors: std.atomic.Value(i64) = std.atomic.Value(i64).init(0); 14 + var delta_cache_hits: std.atomic.Value(i64) = std.atomic.Value(i64).init(0); 15 + var delta_cache_misses: std.atomic.Value(i64) = std.atomic.Value(i64).init(0); 16 + 17 + // popular searches ring buffer 18 + var pending_searches: [MAX_PENDING_SEARCHES]?[]const u8 = .{null} ** MAX_PENDING_SEARCHES; 19 + var search_write_idx: usize = 0; 20 + var search_read_idx: usize = 0; 21 + var search_mutex: std.Thread.Mutex = .{}; 22 + 23 + // allocator for search string copies 24 + var gpa: std.heap.GeneralPurposeAllocator(.{}) = .{}; 25 + 26 + var sync_thread: ?std.Thread = null; 27 + 28 + pub fn init() void { 29 + sync_thread = std.Thread.spawn(.{}, syncLoop, .{}) catch |err| { 30 + logfire.warn("stats_buffer: failed to start sync thread: {}", .{err}); 31 + return; 32 + }; 33 + if (sync_thread) |t| t.detach(); 34 + logfire.info("stats_buffer: initialized with {d}ms sync interval", .{SYNC_INTERVAL_MS}); 35 + } 36 + 37 + // instant, non-blocking increments 38 + pub fn recordSearch() void { 39 + _ = delta_searches.fetchAdd(1, .monotonic); 40 + } 41 + 42 + pub fn recordError() void { 43 + _ = delta_errors.fetchAdd(1, .monotonic); 44 + } 45 + 46 + pub fn recordCacheHit() void { 47 + _ = delta_cache_hits.fetchAdd(1, .monotonic); 48 + } 49 + 50 + pub fn recordCacheMiss() void { 51 + _ = delta_cache_misses.fetchAdd(1, .monotonic); 52 + } 53 + 54 + // queue popular search (best effort, drops if full) 55 + pub fn queuePopularSearch(query: []const u8) void { 56 + if (query.len < 2) return; 57 + 58 + search_mutex.lock(); 59 + defer search_mutex.unlock(); 60 + 61 + // check if buffer is full 62 + const next_write = (search_write_idx + 1) % MAX_PENDING_SEARCHES; 63 + if (next_write == search_read_idx) { 64 + // buffer full, drop oldest 65 + if (pending_searches[search_read_idx]) |old| { 66 + gpa.allocator().free(old); 67 + pending_searches[search_read_idx] = null; 68 + } 69 + search_read_idx = (search_read_idx + 1) % MAX_PENDING_SEARCHES; 70 + } 71 + 72 + // allocate copy 73 + const copy = gpa.allocator().dupe(u8, query) catch return; 74 + pending_searches[search_write_idx] = copy; 75 + search_write_idx = next_write; 76 + } 77 + 78 + // get current totals (base from db + pending deltas) 79 + pub fn getSearchCount(base: i64) i64 { 80 + return base + delta_searches.load(.acquire); 81 + } 82 + 83 + pub fn getErrorCount(base: i64) i64 { 84 + return base + delta_errors.load(.acquire); 85 + } 86 + 87 + pub fn getCacheHitCount(base: i64) i64 { 88 + return base + delta_cache_hits.load(.acquire); 89 + } 90 + 91 + pub fn getCacheMissCount(base: i64) i64 { 92 + return base + delta_cache_misses.load(.acquire); 93 + } 94 + 95 + fn syncLoop() void { 96 + while (true) { 97 + std.Thread.sleep(SYNC_INTERVAL_MS * std.time.ns_per_ms); 98 + syncToTurso(); 99 + } 100 + } 101 + 102 + fn syncToTurso() void { 103 + const c = db.getClient() orelse return; 104 + 105 + // swap deltas to zero and get values 106 + const searches = delta_searches.swap(0, .acq_rel); 107 + const errors = delta_errors.swap(0, .acq_rel); 108 + const cache_hits = delta_cache_hits.swap(0, .acq_rel); 109 + const cache_misses = delta_cache_misses.swap(0, .acq_rel); 110 + 111 + // sync stats if any changed 112 + if (searches != 0 or errors != 0 or cache_hits != 0 or cache_misses != 0) { 113 + syncStatsDelta(c, searches, errors, cache_hits, cache_misses); 114 + } 115 + 116 + // sync popular searches 117 + syncPopularSearches(c); 118 + } 119 + 120 + fn syncStatsDelta(c: *db.Client, searches: i64, errors: i64, cache_hits: i64, cache_misses: i64) void { 121 + // build SQL with values embedded (safe - these are i64, not user input) 122 + var sql_buf: [512]u8 = undefined; 123 + const sql = std.fmt.bufPrint(&sql_buf, 124 + \\UPDATE stats SET 125 + \\ total_searches = total_searches + {d}, 126 + \\ total_errors = total_errors + {d}, 127 + \\ cache_hits = COALESCE(cache_hits, 0) + {d}, 128 + \\ cache_misses = COALESCE(cache_misses, 0) + {d} 129 + \\WHERE id = 1 130 + , .{ searches, errors, cache_hits, cache_misses }) catch return; 131 + 132 + // use queryBatch which accepts runtime SQL 133 + var statements = [_]db.Client.Statement{.{ .sql = sql, .args = &.{} }}; 134 + var batch = c.queryBatch(&statements) catch |err| { 135 + logfire.warn("stats_buffer: sync failed: {}, restoring deltas", .{err}); 136 + // restore deltas on failure 137 + _ = delta_searches.fetchAdd(searches, .monotonic); 138 + _ = delta_errors.fetchAdd(errors, .monotonic); 139 + _ = delta_cache_hits.fetchAdd(cache_hits, .monotonic); 140 + _ = delta_cache_misses.fetchAdd(cache_misses, .monotonic); 141 + return; 142 + }; 143 + batch.deinit(); 144 + 145 + logfire.debug("stats_buffer: synced deltas (searches={d}, errors={d}, hits={d}, misses={d})", .{ searches, errors, cache_hits, cache_misses }); 146 + } 147 + 148 + fn syncPopularSearches(c: *db.Client) void { 149 + search_mutex.lock(); 150 + defer search_mutex.unlock(); 151 + 152 + var synced: usize = 0; 153 + while (search_read_idx != search_write_idx) { 154 + if (pending_searches[search_read_idx]) |query| { 155 + // sync to Turso 156 + c.exec( 157 + "INSERT INTO popular_searches (query, count) VALUES (?, 1) ON CONFLICT(query) DO UPDATE SET count = count + 1", 158 + &.{query}, 159 + ) catch {}; 160 + 161 + // free and clear 162 + gpa.allocator().free(query); 163 + pending_searches[search_read_idx] = null; 164 + synced += 1; 165 + } 166 + search_read_idx = (search_read_idx + 1) % MAX_PENDING_SEARCHES; 167 + } 168 + 169 + if (synced > 0) { 170 + logfire.debug("stats_buffer: synced {d} popular searches", .{synced}); 171 + } 172 + }

+113 -49

backend/src/tap.zig

··· 5 5 const Allocator = mem.Allocator; 6 6 const websocket = @import("websocket"); 7 7 const zat = @import("zat"); 8 + const logfire = @import("logfire"); 8 9 const indexer = @import("indexer.zig"); 9 10 const extractor = @import("extractor.zig"); 10 11 ··· 48 49 if (connected) |_| { 49 50 // connection succeeded then closed - reset backoff 50 51 backoff = 1; 51 - std.debug.print("tap connection closed, reconnecting immediately...\n", .{}); 52 + logfire.info("tap connection closed, reconnecting immediately", .{}); 52 53 } else |err| { 53 54 // connection failed - backoff 54 - std.debug.print("tap error: {}, reconnecting in {}s...\n", .{ err, backoff }); 55 + logfire.warn("tap error: {}, reconnecting in {d}s", .{ err, backoff }); 55 56 posix.nanosleep(backoff, 0); 56 57 backoff = @min(backoff * 2, max_backoff); 57 58 } ··· 60 61 61 62 const Handler = struct { 62 63 allocator: Allocator, 64 + client: *websocket.Client, 63 65 msg_count: usize = 0, 66 + ack_buf: [64]u8 = undefined, 64 67 65 68 pub fn serverMessage(self: *Handler, data: []const u8) !void { 66 69 self.msg_count += 1; 67 - if (self.msg_count % 100 == 1) { 68 - std.debug.print("tap: received {} messages\n", .{self.msg_count}); 70 + if (self.msg_count % 1000 == 0) { 71 + logfire.info("tap: processed {d} messages", .{self.msg_count}); 69 72 } 73 + 74 + // extract message ID for ACK 75 + const msg_id = extractMessageId(self.allocator, data); 76 + 77 + // process the message 70 78 processMessage(self.allocator, data) catch |err| { 71 - std.debug.print("message processing error: {}\n", .{err}); 79 + logfire.err("message processing error: {}", .{err}); 80 + // still ACK even on error to avoid infinite retries 72 81 }; 82 + 83 + // send ACK if we have a message ID 84 + if (msg_id) |id| { 85 + self.sendAck(id); 86 + } 73 87 } 74 88 75 - pub fn close(_: *Handler) void { 76 - std.debug.print("tap connection closed\n", .{}); 89 + fn sendAck(self: *Handler, msg_id: i64) void { 90 + const ack_json = std.fmt.bufPrint(&self.ack_buf, "{{\"type\":\"ack\",\"id\":{d}}}", .{msg_id}) catch |err| { 91 + logfire.err("tap: ACK format error: {}", .{err}); 92 + return; 93 + }; 94 + self.client.write(@constCast(ack_json)) catch |err| { 95 + logfire.err("tap: failed to send ACK: {}", .{err}); 96 + }; 77 97 } 98 + 99 + pub fn close(_: *Handler) void {} 78 100 }; 79 101 102 + fn extractMessageId(allocator: Allocator, payload: []const u8) ?i64 { 103 + const parsed = json.parseFromSlice(json.Value, allocator, payload, .{}) catch return null; 104 + defer parsed.deinit(); 105 + return zat.json.getInt(parsed.value, "id"); 106 + } 107 + 80 108 fn connect(allocator: Allocator) !void { 81 109 const host = getTapHost(); 82 110 const port = getTapPort(); 83 111 const tls = useTls(); 84 112 const path = "/channel"; 85 113 86 - std.debug.print("connecting to {s}://{s}:{d}{s}\n", .{ if (tls) "wss" else "ws", host, port, path }); 114 + logfire.info("connecting to {s}://{s}:{d}{s}", .{ if (tls) "wss" else "ws", host, port, path }); 87 115 88 116 var client = websocket.Client.init(allocator, .{ 89 117 .host = host, ··· 91 119 .tls = tls, 92 120 .max_size = 1024 * 1024, // 1MB 93 121 }) catch |err| { 94 - std.debug.print("websocket client init failed: {}\n", .{err}); 122 + logfire.err("websocket client init failed: {}", .{err}); 95 123 return err; 96 124 }; 97 125 defer client.deinit(); ··· 100 128 const host_header = std.fmt.bufPrint(&host_header_buf, "Host: {s}\r\n", .{host}) catch host; 101 129 102 130 client.handshake(path, .{ .headers = host_header }) catch |err| { 103 - std.debug.print("websocket handshake failed: {}\n", .{err}); 131 + logfire.err("websocket handshake failed: {}", .{err}); 104 132 return err; 105 133 }; 106 134 107 - std.debug.print("tap connected!\n", .{}); 135 + logfire.info("tap connected", .{}); 108 136 109 - var handler = Handler{ .allocator = allocator }; 137 + var handler = Handler{ .allocator = allocator, .client = &client }; 110 138 client.readLoop(&handler) catch |err| { 111 - std.debug.print("websocket read loop error: {}\n", .{err}); 139 + logfire.err("websocket read loop error: {}", .{err}); 112 140 return err; 113 141 }; 114 142 } ··· 116 144 /// TAP record envelope - extracted via zat.json.extractAt 117 145 const TapRecord = struct { 118 146 collection: []const u8, 119 - action: zat.CommitAction, 147 + action: []const u8, // "create", "update", "delete" 120 148 did: []const u8, 121 149 rkey: []const u8, 150 + 151 + pub fn isCreate(self: TapRecord) bool { 152 + return mem.eql(u8, self.action, "create"); 153 + } 154 + pub fn isUpdate(self: TapRecord) bool { 155 + return mem.eql(u8, self.action, "update"); 156 + } 157 + pub fn isDelete(self: TapRecord) bool { 158 + return mem.eql(u8, self.action, "delete"); 159 + } 122 160 }; 123 161 124 162 /// Leaflet publication fields ··· 129 167 }; 130 168 131 169 fn processMessage(allocator: Allocator, payload: []const u8) !void { 132 - const parsed = json.parseFromSlice(json.Value, allocator, payload, .{}) catch return; 170 + const parsed = json.parseFromSlice(json.Value, allocator, payload, .{}) catch { 171 + logfire.err("tap: JSON parse failed, first 100 bytes: {s}", .{payload[0..@min(payload.len, 100)]}); 172 + return; 173 + }; 133 174 defer parsed.deinit(); 134 175 135 176 // check message type 136 - const msg_type = zat.json.getString(parsed.value, "type") orelse return; 177 + const msg_type = zat.json.getString(parsed.value, "type") orelse { 178 + logfire.warn("tap: no type field in message", .{}); 179 + return; 180 + }; 181 + 137 182 if (!mem.eql(u8, msg_type, "record")) return; 138 183 139 - // extract record envelope 140 - const rec = zat.json.extractAt(TapRecord, allocator, parsed.value, .{"record"}) catch return; 184 + // extract record envelope (extractAt ignores extra fields like live, rev, cid) 185 + const rec = zat.json.extractAt(TapRecord, allocator, parsed.value, .{"record"}) catch |err| { 186 + logfire.warn("tap: failed to extract record: {}", .{err}); 187 + return; 188 + }; 141 189 142 190 // validate DID 143 191 const did = zat.Did.parse(rec.did) orelse return; 144 192 145 - // build AT-URI string 146 - const uri = try std.fmt.allocPrint(allocator, "at://{s}/{s}/{s}", .{ did.raw, rec.collection, rec.rkey }); 147 - defer allocator.free(uri); 193 + // build AT-URI string (no allocation - uses stack buffer) 194 + var uri_buf: [256]u8 = undefined; 195 + const uri = zat.AtUri.format(&uri_buf, did.raw, rec.collection, rec.rkey) orelse return; 196 + 197 + // span for the actual indexing work 198 + const span = logfire.span("tap.index_record", .{}); 199 + defer span.end(); 148 200 149 - switch (rec.action) { 150 - .create, .update => { 151 - const record_obj = zat.json.getObject(parsed.value, "record.record") orelse return; 201 + if (rec.isCreate() or rec.isUpdate()) { 202 + const inner_record = zat.json.getObject(parsed.value, "record.record") orelse return; 152 203 153 - if (isDocumentCollection(rec.collection)) { 154 - processDocument(allocator, uri, did.raw, rec.rkey, record_obj, rec.collection) catch |err| { 155 - std.debug.print("document processing error: {}\n", .{err}); 156 - }; 157 - } else if (isPublicationCollection(rec.collection)) { 158 - processPublication(allocator, uri, did.raw, rec.rkey, record_obj) catch |err| { 159 - std.debug.print("publication processing error: {}\n", .{err}); 160 - }; 161 - } 162 - }, 163 - .delete => { 164 - if (isDocumentCollection(rec.collection)) { 165 - indexer.deleteDocument(uri); 166 - std.debug.print("deleted document: {s}\n", .{uri}); 167 - } else if (isPublicationCollection(rec.collection)) { 168 - indexer.deletePublication(uri); 169 - std.debug.print("deleted publication: {s}\n", .{uri}); 170 - } 171 - }, 204 + if (isDocumentCollection(rec.collection)) { 205 + processDocument(allocator, uri, did.raw, rec.rkey, inner_record, rec.collection) catch |err| { 206 + logfire.err("document processing error: {}", .{err}); 207 + }; 208 + } else if (isPublicationCollection(rec.collection)) { 209 + processPublication(allocator, uri, did.raw, rec.rkey, inner_record) catch |err| { 210 + logfire.err("publication processing error: {}", .{err}); 211 + }; 212 + } 213 + } else if (rec.isDelete()) { 214 + if (isDocumentCollection(rec.collection)) { 215 + indexer.deleteDocument(uri); 216 + } else if (isPublicationCollection(rec.collection)) { 217 + indexer.deletePublication(uri); 218 + } 172 219 } 173 220 } 174 221 175 222 fn processDocument(allocator: Allocator, uri: []const u8, did: []const u8, rkey: []const u8, record: json.ObjectMap, collection: []const u8) !void { 176 223 var doc = extractor.extractDocument(allocator, record, collection) catch |err| { 177 224 if (err != error.NoContent and err != error.MissingTitle) { 178 - std.debug.print("extraction error for {s}: {}\n", .{ uri, err }); 225 + logfire.warn("extraction error for {s}: {}", .{ uri, err }); 179 226 } 180 227 return; 181 228 }; ··· 192 239 doc.tags, 193 240 doc.platformName(), 194 241 doc.source_collection, 242 + doc.path, 243 + doc.content_type, 195 244 ); 196 - std.debug.print("indexed document: {s} [{s}] ({} chars, {} tags)\n", .{ uri, doc.platformName(), doc.content.len, doc.tags.len }); 245 + logfire.counter("tap.documents_indexed", 1); 197 246 } 198 247 199 - fn processPublication(allocator: Allocator, uri: []const u8, did: []const u8, rkey: []const u8, record: json.ObjectMap) !void { 248 + fn processPublication(_: Allocator, uri: []const u8, did: []const u8, rkey: []const u8, record: json.ObjectMap) !void { 200 249 const record_val: json.Value = .{ .object = record }; 201 - const pub_data = zat.json.extractAt(LeafletPublication, allocator, record_val, .{}) catch return; 202 250 203 - try indexer.insertPublication(uri, did, rkey, pub_data.name, pub_data.description, pub_data.base_path); 204 - std.debug.print("indexed publication: {s} (base_path: {s})\n", .{ uri, pub_data.base_path orelse "none" }); 251 + // extract required field 252 + const name = zat.json.getString(record_val, "name") orelse return; 253 + const description = zat.json.getString(record_val, "description"); 254 + 255 + // base_path: try leaflet's "base_path", then site.standard's "url" 256 + // url is full URL like "https://devlog.pckt.blog", we need just the host 257 + const base_path = zat.json.getString(record_val, "base_path") orelse 258 + stripUrlScheme(zat.json.getString(record_val, "url")); 259 + 260 + try indexer.insertPublication(uri, did, rkey, name, description, base_path); 261 + logfire.counter("tap.publications_indexed", 1); 262 + } 263 + 264 + fn stripUrlScheme(url: ?[]const u8) ?[]const u8 { 265 + const u = url orelse return null; 266 + if (mem.startsWith(u8, u, "https://")) return u["https://".len..]; 267 + if (mem.startsWith(u8, u, "http://")) return u["http://".len..]; 268 + return u; 205 269 }

+265

backend/src/timing.zig

··· 1 + const std = @import("std"); 2 + 3 + /// endpoints we track latency for 4 + pub const Endpoint = enum { 5 + search, 6 + similar, 7 + tags, 8 + popular, 9 + 10 + pub fn name(self: Endpoint) []const u8 { 11 + return @tagName(self); 12 + } 13 + }; 14 + 15 + const SAMPLE_COUNT = 1000; 16 + const ENDPOINT_COUNT = @typeInfo(Endpoint).@"enum".fields.len; 17 + const PERSIST_PATH = "/data/timing.bin"; 18 + const PERSIST_PATH_HOURLY = "/data/timing_hourly.bin"; 19 + const HOURS_TO_KEEP = 24; 20 + 21 + /// per-endpoint latency buffer 22 + const LatencyBuffer = struct { 23 + samples: [SAMPLE_COUNT]u32 = .{0} ** SAMPLE_COUNT, // microseconds 24 + count: usize = 0, 25 + head: usize = 0, 26 + total_count: u64 = 0, 27 + 28 + fn record(self: *LatencyBuffer, latency_us: u32) void { 29 + self.samples[self.head] = latency_us; 30 + self.head = (self.head + 1) % SAMPLE_COUNT; 31 + if (self.count < SAMPLE_COUNT) self.count += 1; 32 + self.total_count += 1; 33 + } 34 + }; 35 + 36 + /// hourly bucket for time series 37 + const HourlyBucket = struct { 38 + hour: i64 = 0, // unix timestamp of hour start 39 + count: u32 = 0, 40 + sum_us: u64 = 0, 41 + max_us: u32 = 0, 42 + 43 + fn record(self: *HourlyBucket, hour: i64, latency_us: u32) void { 44 + if (self.hour != hour) { 45 + // new hour, reset 46 + self.hour = hour; 47 + self.count = 0; 48 + self.sum_us = 0; 49 + self.max_us = 0; 50 + } 51 + self.count += 1; 52 + self.sum_us += latency_us; 53 + if (latency_us > self.max_us) self.max_us = latency_us; 54 + } 55 + }; 56 + 57 + /// time series data point for API response 58 + pub const TimeSeriesPoint = struct { 59 + hour: i64, 60 + count: u32, 61 + avg_ms: f64, 62 + max_ms: f64, 63 + }; 64 + 65 + /// computed stats for an endpoint 66 + pub const EndpointStats = struct { 67 + count: u64 = 0, 68 + avg_ms: f64 = 0, 69 + p50_ms: f64 = 0, 70 + p95_ms: f64 = 0, 71 + p99_ms: f64 = 0, 72 + max_ms: f64 = 0, 73 + }; 74 + 75 + var buffers: [ENDPOINT_COUNT]LatencyBuffer = [_]LatencyBuffer{.{}} ** ENDPOINT_COUNT; 76 + var hourly: [ENDPOINT_COUNT][HOURS_TO_KEEP]HourlyBucket = [_][HOURS_TO_KEEP]HourlyBucket{[_]HourlyBucket{.{}} ** HOURS_TO_KEEP} ** ENDPOINT_COUNT; 77 + var mutex: std.Thread.Mutex = .{}; 78 + var initialized: bool = false; 79 + 80 + fn getCurrentHour() i64 { 81 + const now_s = @divFloor(std.time.timestamp(), 3600) * 3600; 82 + return now_s; 83 + } 84 + 85 + fn getHourIndex(hour: i64) usize { 86 + // use hour as index into ring buffer 87 + return @intCast(@mod(@divFloor(hour, 3600), HOURS_TO_KEEP)); 88 + } 89 + 90 + /// record a request latency (call after request completes) 91 + pub fn record(endpoint: Endpoint, start_time: i64) void { 92 + const now = std.time.microTimestamp(); 93 + const elapsed_us: u32 = @intCast(@max(0, now - start_time)); 94 + const current_hour = getCurrentHour(); 95 + const hour_idx = getHourIndex(current_hour); 96 + 97 + mutex.lock(); 98 + defer mutex.unlock(); 99 + 100 + ensureInitialized(); 101 + 102 + const ep_idx = @intFromEnum(endpoint); 103 + buffers[ep_idx].record(elapsed_us); 104 + hourly[ep_idx][hour_idx].record(current_hour, elapsed_us); 105 + 106 + // persist immediately 107 + persistLocked(); 108 + persistHourlyLocked(); 109 + } 110 + 111 + fn loadLocked() void { 112 + const file = std.fs.openFileAbsolute(PERSIST_PATH, .{}) catch return; 113 + defer file.close(); 114 + 115 + // read entire file at once (small file, ~16KB per endpoint) 116 + var file_buf: [ENDPOINT_COUNT * (@sizeOf([SAMPLE_COUNT]u32) + @sizeOf(usize) * 2 + @sizeOf(u64))]u8 = undefined; 117 + const bytes_read = file.readAll(&file_buf) catch return; 118 + if (bytes_read != file_buf.len) return; // incomplete file 119 + 120 + var offset: usize = 0; 121 + for (&buffers) |*buf| { 122 + const samples_size = @sizeOf([SAMPLE_COUNT]u32); 123 + buf.samples = std.mem.bytesToValue([SAMPLE_COUNT]u32, file_buf[offset..][0..samples_size]); 124 + offset += samples_size; 125 + 126 + buf.count = std.mem.readInt(usize, file_buf[offset..][0..@sizeOf(usize)], .little); 127 + offset += @sizeOf(usize); 128 + 129 + buf.head = std.mem.readInt(usize, file_buf[offset..][0..@sizeOf(usize)], .little); 130 + offset += @sizeOf(usize); 131 + 132 + buf.total_count = std.mem.readInt(u64, file_buf[offset..][0..@sizeOf(u64)], .little); 133 + offset += @sizeOf(u64); 134 + } 135 + } 136 + 137 + fn persistLocked() void { 138 + const file = std.fs.createFileAbsolute(PERSIST_PATH, .{}) catch return; 139 + defer file.close(); 140 + 141 + // write all buffers 142 + for (buffers) |buf| { 143 + file.writeAll(std.mem.asBytes(&buf.samples)) catch return; 144 + file.writeAll(std.mem.asBytes(&buf.count)) catch return; 145 + file.writeAll(std.mem.asBytes(&buf.head)) catch return; 146 + file.writeAll(std.mem.asBytes(&buf.total_count)) catch return; 147 + } 148 + } 149 + 150 + fn loadHourlyLocked() void { 151 + const file = std.fs.openFileAbsolute(PERSIST_PATH_HOURLY, .{}) catch return; 152 + defer file.close(); 153 + 154 + const bucket_size = @sizeOf(HourlyBucket); 155 + const total_size = ENDPOINT_COUNT * HOURS_TO_KEEP * bucket_size; 156 + var file_buf: [total_size]u8 = undefined; 157 + const bytes_read = file.readAll(&file_buf) catch return; 158 + if (bytes_read != total_size) return; 159 + 160 + var offset: usize = 0; 161 + for (&hourly) |*ep_buckets| { 162 + for (ep_buckets) |*bucket| { 163 + bucket.* = std.mem.bytesToValue(HourlyBucket, file_buf[offset..][0..bucket_size]); 164 + offset += bucket_size; 165 + } 166 + } 167 + } 168 + 169 + fn persistHourlyLocked() void { 170 + const file = std.fs.createFileAbsolute(PERSIST_PATH_HOURLY, .{}) catch return; 171 + defer file.close(); 172 + 173 + for (hourly) |ep_buckets| { 174 + for (ep_buckets) |bucket| { 175 + file.writeAll(std.mem.asBytes(&bucket)) catch return; 176 + } 177 + } 178 + } 179 + 180 + fn ensureInitialized() void { 181 + if (!initialized) { 182 + initialized = true; 183 + loadLocked(); 184 + loadHourlyLocked(); 185 + } 186 + } 187 + 188 + /// get stats for a specific endpoint 189 + pub fn getStats(endpoint: Endpoint) EndpointStats { 190 + mutex.lock(); 191 + defer mutex.unlock(); 192 + 193 + ensureInitialized(); 194 + 195 + const buf = &buffers[@intFromEnum(endpoint)]; 196 + if (buf.count == 0) return .{}; 197 + 198 + // copy and sort for percentiles 199 + var sorted: [SAMPLE_COUNT]u32 = undefined; 200 + @memcpy(sorted[0..buf.count], buf.samples[0..buf.count]); 201 + std.mem.sort(u32, sorted[0..buf.count], {}, std.sort.asc(u32)); 202 + 203 + var sum: u64 = 0; 204 + for (sorted[0..buf.count]) |v| sum += v; 205 + 206 + const count = buf.count; 207 + return .{ 208 + .count = buf.total_count, 209 + .avg_ms = @as(f64, @floatFromInt(sum)) / @as(f64, @floatFromInt(count)) / 1000.0, 210 + .p50_ms = @as(f64, @floatFromInt(sorted[count / 2])) / 1000.0, 211 + .p95_ms = @as(f64, @floatFromInt(sorted[(count * 95) / 100])) / 1000.0, 212 + .p99_ms = @as(f64, @floatFromInt(sorted[(count * 99) / 100])) / 1000.0, 213 + .max_ms = @as(f64, @floatFromInt(sorted[count - 1])) / 1000.0, 214 + }; 215 + } 216 + 217 + /// get stats for all endpoints 218 + pub fn getAllStats() [ENDPOINT_COUNT]EndpointStats { 219 + var result: [ENDPOINT_COUNT]EndpointStats = undefined; 220 + for (0..ENDPOINT_COUNT) |i| { 221 + result[i] = getStats(@enumFromInt(i)); 222 + } 223 + return result; 224 + } 225 + 226 + /// get time series for an endpoint (last 24 hours) 227 + pub fn getTimeSeries(endpoint: Endpoint) [HOURS_TO_KEEP]TimeSeriesPoint { 228 + mutex.lock(); 229 + defer mutex.unlock(); 230 + 231 + ensureInitialized(); 232 + 233 + const current_hour = getCurrentHour(); 234 + const ep_buckets = hourly[@intFromEnum(endpoint)]; 235 + var result: [HOURS_TO_KEEP]TimeSeriesPoint = undefined; 236 + 237 + // return hours in chronological order, oldest first 238 + for (0..HOURS_TO_KEEP) |i| { 239 + const hours_ago = HOURS_TO_KEEP - 1 - i; 240 + const hour = current_hour - @as(i64, @intCast(hours_ago)) * 3600; 241 + const idx = getHourIndex(hour); 242 + const bucket = ep_buckets[idx]; 243 + 244 + if (bucket.hour == hour and bucket.count > 0) { 245 + result[i] = .{ 246 + .hour = hour, 247 + .count = bucket.count, 248 + .avg_ms = @as(f64, @floatFromInt(bucket.sum_us)) / @as(f64, @floatFromInt(bucket.count)) / 1000.0, 249 + .max_ms = @as(f64, @floatFromInt(bucket.max_us)) / 1000.0, 250 + }; 251 + } else { 252 + result[i] = .{ .hour = hour, .count = 0, .avg_ms = 0, .max_ms = 0 }; 253 + } 254 + } 255 + return result; 256 + } 257 + 258 + /// get time series for all endpoints 259 + pub fn getAllTimeSeries() [ENDPOINT_COUNT][HOURS_TO_KEEP]TimeSeriesPoint { 260 + var result: [ENDPOINT_COUNT][HOURS_TO_KEEP]TimeSeriesPoint = undefined; 261 + for (0..ENDPOINT_COUNT) |i| { 262 + result[i] = getTimeSeries(@enumFromInt(i)); 263 + } 264 + return result; 265 + }

+198

docs/api.md

··· 1 + # API reference 2 + 3 + base URL: `https://leaflet-search-backend.fly.dev` 4 + 5 + ## endpoints 6 + 7 + ### search 8 + 9 + ``` 10 + GET /search?q=<query>&tag=<tag>&platform=<platform>&since=<date> 11 + ``` 12 + 13 + full-text search across documents and publications. 14 + 15 + **parameters:** 16 + | param | type | required | description | 17 + |-------|------|----------|-------------| 18 + | `q` | string | no* | search query (titles and content) | 19 + | `tag` | string | no | filter by tag (documents only) | 20 + | `platform` | string | no | filter by platform: `leaflet`, `pckt`, `offprint`, `greengale`, `other` | 21 + | `since` | string | no | ISO date, filter to documents created after | 22 + 23 + *at least one of `q` or `tag` required 24 + 25 + **response:** 26 + ```json 27 + [ 28 + { 29 + "type": "article|looseleaf|publication", 30 + "uri": "at://did:plc:.../collection/rkey", 31 + "did": "did:plc:...", 32 + "title": "document title", 33 + "snippet": "...matched text...", 34 + "createdAt": "2025-01-15T...", 35 + "rkey": "abc123", 36 + "basePath": "gyst.leaflet.pub", 37 + "platform": "leaflet", 38 + "path": "/001" 39 + } 40 + ] 41 + ``` 42 + 43 + **result types:** 44 + - `article`: document in a publication 45 + - `looseleaf`: standalone document (no publication) 46 + - `publication`: the publication itself (only returned for text queries, not tag/platform filters) 47 + 48 + **ranking:** hybrid BM25 + recency. text relevance primary, recent docs boosted (~1 point per 30 days). 49 + 50 + ### similar 51 + 52 + ``` 53 + GET /similar?uri=<at-uri> 54 + ``` 55 + 56 + find semantically similar documents using vector similarity (voyage-3-lite embeddings). 57 + 58 + **parameters:** 59 + | param | type | required | description | 60 + |-------|------|----------|-------------| 61 + | `uri` | string | yes | AT-URI of source document | 62 + 63 + **response:** same format as search (array of results) 64 + 65 + ### tags 66 + 67 + ``` 68 + GET /tags 69 + ``` 70 + 71 + list all tags with document counts, sorted by popularity. 72 + 73 + **response:** 74 + ```json 75 + [ 76 + {"tag": "programming", "count": 42}, 77 + {"tag": "rust", "count": 15} 78 + ] 79 + ``` 80 + 81 + ### popular 82 + 83 + ``` 84 + GET /popular 85 + ``` 86 + 87 + popular search queries. 88 + 89 + **response:** 90 + ```json 91 + [ 92 + {"query": "rust async", "count": 12}, 93 + {"query": "leaflet", "count": 8} 94 + ] 95 + ``` 96 + 97 + ### platforms 98 + 99 + ``` 100 + GET /platforms 101 + ``` 102 + 103 + document counts by platform. 104 + 105 + **response:** 106 + ```json 107 + [ 108 + {"platform": "leaflet", "count": 2500}, 109 + {"platform": "pckt", "count": 800}, 110 + {"platform": "greengale", "count": 150}, 111 + {"platform": "offprint", "count": 50}, 112 + {"platform": "other", "count": 100} 113 + ] 114 + ``` 115 + 116 + ### stats 117 + 118 + ``` 119 + GET /stats 120 + ``` 121 + 122 + index statistics and request timing. 123 + 124 + **response:** 125 + ```json 126 + { 127 + "documents": 3500, 128 + "publications": 120, 129 + "embeddings": 3200, 130 + "searches": 5000, 131 + "errors": 5, 132 + "cache_hits": 1200, 133 + "cache_misses": 800, 134 + "timing": { 135 + "search": {"count": 1000, "avg_ms": 25, "p50_ms": 20, "p95_ms": 50, "p99_ms": 80, "max_ms": 150}, 136 + "similar": {"count": 200, "avg_ms": 150, "p50_ms": 140, "p95_ms": 200, "p99_ms": 250, "max_ms": 300}, 137 + "tags": {"count": 500, "avg_ms": 5, "p50_ms": 4, "p95_ms": 10, "p99_ms": 15, "max_ms": 25}, 138 + "popular": {"count": 300, "avg_ms": 3, "p50_ms": 2, "p95_ms": 5, "p99_ms": 8, "max_ms": 12} 139 + } 140 + } 141 + ``` 142 + 143 + ### activity 144 + 145 + ``` 146 + GET /activity 147 + ``` 148 + 149 + hourly activity counts (last 24 hours). 150 + 151 + **response:** 152 + ```json 153 + [12, 8, 5, 3, 2, 1, 0, 0, 1, 5, 15, 25, 30, 28, 22, 18, 20, 25, 30, 35, 28, 20, 15, 10] 154 + ``` 155 + 156 + ### dashboard 157 + 158 + ``` 159 + GET /api/dashboard 160 + ``` 161 + 162 + rich dashboard data for analytics UI. 163 + 164 + **response:** 165 + ```json 166 + { 167 + "startedAt": 1705000000, 168 + "searches": 5000, 169 + "publications": 120, 170 + "documents": 3500, 171 + "platforms": [{"platform": "leaflet", "count": 2500}], 172 + "tags": [{"tag": "programming", "count": 42}], 173 + "timeline": [{"date": "2025-01-15", "count": 25}], 174 + "topPubs": [{"name": "gyst", "basePath": "gyst.leaflet.pub", "count": 150}], 175 + "timing": {...} 176 + } 177 + ``` 178 + 179 + ### health 180 + 181 + ``` 182 + GET /health 183 + ``` 184 + 185 + **response:** 186 + ```json 187 + {"status": "ok"} 188 + ``` 189 + 190 + ## building URLs 191 + 192 + documents can be accessed on the web via their `basePath` and `rkey`: 193 + - articles: `https://{basePath}/{rkey}` or `https://{basePath}{path}` if path is set 194 + - publications: `https://{basePath}` 195 + 196 + examples: 197 + - `https://gyst.leaflet.pub/3ldasifz7bs2l` 198 + - `https://greengale.app/3fz.org/001`

+99

docs/content-extraction.md

··· 1 + # content extraction for site.standard.document 2 + 3 + lessons learned from implementing cross-platform content extraction. 4 + 5 + ## the problem 6 + 7 + [eli mallon raised this question](https://bsky.app/profile/iame.li/post/3md4s4vm2os2y): 8 + 9 + > The `site.standard.document` "content" field kinda confuses me. I see my leaflet posts have a $type field of "pub.leaflet.content". So if I were writing a renderer for site.standard.document records, presumably I'd have to know about separate things for leaflet, pckt, and offprint. 10 + 11 + short answer: yes. but once you handle `content.pages` extraction, it's straightforward. 12 + 13 + ## textContent: platform-dependent 14 + 15 + `site.standard.document` has a `textContent` field for pre-flattened plaintext: 16 + 17 + ```json 18 + { 19 + "title": "my post", 20 + "textContent": "the full text content, ready for indexing...", 21 + "content": { 22 + "$type": "blog.pckt.content", 23 + "items": [ /* platform-specific blocks */ ] 24 + } 25 + } 26 + ``` 27 + 28 + **pckt, offprint, greengale** populate `textContent`. extraction is trivial. 29 + 30 + **leaflet** intentionally leaves `textContent` null to avoid inflating record size. content lives in `content.pages[].blocks[].block.plaintext`. 31 + 32 + ## extraction strategy 33 + 34 + priority order (in `extractor.zig`): 35 + 36 + 1. `textContent` - use if present 37 + 2. `pages` - top-level blocks (pub.leaflet.document) 38 + 3. `content.pages` - nested blocks (site.standard.document with pub.leaflet.content) 39 + 40 + ```zig 41 + // try textContent first 42 + if (zat.json.getString(record, "textContent")) |text| { 43 + return text; 44 + } 45 + 46 + // fall back to block parsing 47 + const pages = zat.json.getArray(record, "pages") orelse 48 + zat.json.getArray(record, "content.pages"); 49 + ``` 50 + 51 + the key insight: if you extract from `content.pages` correctly, you're good. no need for extra network calls. 52 + 53 + ## deduplication 54 + 55 + documents can appear in both collections with identical `(did, rkey)`: 56 + - `site.standard.document` 57 + - `pub.leaflet.document` 58 + 59 + handle with `ON CONFLICT`: 60 + 61 + ```sql 62 + INSERT INTO documents (uri, ...) 63 + ON CONFLICT(uri) DO UPDATE SET ... 64 + ``` 65 + 66 + note: leaflet is phasing out `pub.leaflet.document` records, keeping old ones for backwards compat. 67 + 68 + ## platform detection 69 + 70 + collection name doesn't indicate platform for `site.standard.*` records. detection order: 71 + 72 + 1. **basePath** - infer from publication basePath: 73 + 74 + | basePath contains | platform | 75 + |-------------------|----------| 76 + | `leaflet.pub` | leaflet | 77 + | `pckt.blog` | pckt | 78 + | `offprint.app` | offprint | 79 + | `greengale.app` | greengale | 80 + 81 + 2. **content.$type** - fallback for custom domains (e.g., `cailean.journal.ewancroft.uk`): 82 + 83 + | content.$type starts with | platform | 84 + |---------------------------|----------| 85 + | `pub.leaflet.` | leaflet | 86 + 87 + 3. if neither matches → `other` 88 + 89 + ## summary 90 + 91 + - **pckt/offprint/greengale**: use `textContent` directly 92 + - **leaflet**: extract from `content.pages[].blocks[].block.plaintext` 93 + - **deduplication**: `ON CONFLICT` on `(did, rkey)` or `uri` 94 + - **platform**: infer from basePath, fallback to content.$type for custom domains 95 + 96 + ## code references 97 + 98 + - `backend/src/extractor.zig` - content extraction logic, content_type field 99 + - `backend/src/indexer.zig:99-118` - platform detection from basePath + content_type

+226

docs/scratch/leaflet-publishing-plan.md

··· 1 + # publishing to leaflet.pub 2 + 3 + ## goal 4 + 5 + publish markdown docs to both: 6 + 1. `site.standard.document` (for search/interop) - already working 7 + 2. `pub.leaflet.document` (for leaflet.pub display) - this plan 8 + 9 + ## the mapping 10 + 11 + ### block types 12 + 13 + | markdown | leaflet block | 14 + |----------|---------------| 15 + | `# heading` | `pub.leaflet.blocks.header` (level 1-6) | 16 + | paragraph | `pub.leaflet.blocks.text` | 17 + | ``` code ``` | `pub.leaflet.blocks.code` | 18 + | `> quote` | `pub.leaflet.blocks.blockquote` | 19 + | `---` | `pub.leaflet.blocks.horizontalRule` | 20 + | `- item` | `pub.leaflet.blocks.unorderedList` | 21 + | `![alt](src)` | `pub.leaflet.blocks.image` (requires blob upload) | 22 + | `[text](url)` (standalone) | `pub.leaflet.blocks.website` | 23 + 24 + ### inline formatting (facets) 25 + 26 + leaflet uses byte-indexed facets for inline formatting within text blocks: 27 + 28 + ```json 29 + { 30 + "$type": "pub.leaflet.blocks.text", 31 + "plaintext": "hello world with bold text", 32 + "facets": [{ 33 + "index": { "byteStart": 17, "byteEnd": 21 }, 34 + "features": [{ "$type": "pub.leaflet.richtext.facet#bold" }] 35 + }] 36 + } 37 + ``` 38 + 39 + | markdown | facet type | 40 + |----------|------------| 41 + | `**bold**` | `pub.leaflet.richtext.facet#bold` | 42 + | `*italic*` | `pub.leaflet.richtext.facet#italic` | 43 + | `` `code` `` | `pub.leaflet.richtext.facet#code` | 44 + | `[text](url)` | `pub.leaflet.richtext.facet#link` | 45 + | `~~strike~~` | `pub.leaflet.richtext.facet#strikethrough` | 46 + 47 + ## record structure 48 + 49 + ```json 50 + { 51 + "$type": "pub.leaflet.document", 52 + "author": "did:plc:...", 53 + "title": "document title", 54 + "description": "optional description", 55 + "publishedAt": "2026-01-06T00:00:00Z", 56 + "publication": "at://did:plc:.../pub.leaflet.publication/rkey", 57 + "tags": ["tag1", "tag2"], 58 + "pages": [{ 59 + "$type": "pub.leaflet.pages.linearDocument", 60 + "id": "page-uuid", 61 + "blocks": [ 62 + { 63 + "$type": "pub.leaflet.pages.linearDocument#block", 64 + "block": { /* one of the block types above */ } 65 + } 66 + ] 67 + }] 68 + } 69 + ``` 70 + 71 + ## implementation plan 72 + 73 + ### phase 1: markdown parser 74 + 75 + add a simple markdown block parser to zat or the publish script: 76 + 77 + ```zig 78 + const BlockType = enum { 79 + heading, 80 + paragraph, 81 + code, 82 + blockquote, 83 + horizontal_rule, 84 + unordered_list, 85 + image, 86 + }; 87 + 88 + const Block = struct { 89 + type: BlockType, 90 + content: []const u8, 91 + level: ?u8 = null, // for headings 92 + language: ?[]const u8 = null, // for code blocks 93 + alt: ?[]const u8 = null, // for images 94 + src: ?[]const u8 = null, // for images 95 + }; 96 + 97 + fn parseMarkdownBlocks(allocator: Allocator, markdown: []const u8) ![]Block 98 + ``` 99 + 100 + parsing approach: 101 + - split on blank lines to get blocks 102 + - identify block type by first characters: 103 + - `#` → heading (count `#` for level) 104 + - ``` → code block (capture until closing ```) 105 + - `>` → blockquote 106 + - `---` → horizontal rule 107 + - `-` or `*` at start → list item 108 + - `![` → image 109 + - else → paragraph 110 + 111 + ### phase 2: inline facet extraction 112 + 113 + for text blocks, extract inline formatting: 114 + 115 + ```zig 116 + const Facet = struct { 117 + byte_start: usize, 118 + byte_end: usize, 119 + feature: FacetFeature, 120 + }; 121 + 122 + const FacetFeature = union(enum) { 123 + bold, 124 + italic, 125 + code, 126 + link: []const u8, // url 127 + strikethrough, 128 + }; 129 + 130 + fn extractFacets(allocator: Allocator, text: []const u8) !struct { 131 + plaintext: []const u8, 132 + facets: []Facet, 133 + } 134 + ``` 135 + 136 + approach: 137 + - scan for `**`, `*`, `` ` ``, `[`, `~~` 138 + - track byte positions as we strip markers 139 + - build facet list with adjusted indices 140 + 141 + ### phase 3: image blob upload 142 + 143 + images need to be uploaded as blobs before referencing: 144 + 145 + ```zig 146 + fn uploadImageBlob(client: *XrpcClient, allocator: Allocator, image_path: []const u8) !BlobRef 147 + ``` 148 + 149 + for now, could skip images or require them to already be uploaded. 150 + 151 + ### phase 4: json serialization 152 + 153 + build the full `pub.leaflet.document` record: 154 + 155 + ```zig 156 + const LeafletDocument = struct { 157 + @"$type": []const u8 = "pub.leaflet.document", 158 + author: []const u8, 159 + title: []const u8, 160 + description: ?[]const u8 = null, 161 + publishedAt: []const u8, 162 + publication: ?[]const u8 = null, 163 + tags: ?[][]const u8 = null, 164 + pages: []Page, 165 + }; 166 + 167 + const Page = struct { 168 + @"$type": []const u8 = "pub.leaflet.pages.linearDocument", 169 + id: []const u8, 170 + blocks: []BlockWrapper, 171 + }; 172 + ``` 173 + 174 + ### phase 5: integrate into publish-docs.zig 175 + 176 + update the publish script to: 177 + 1. parse markdown into blocks 178 + 2. convert to leaflet structure 179 + 3. publish `pub.leaflet.document` alongside `site.standard.document` 180 + 181 + ```zig 182 + // existing: publish site.standard.document 183 + try putRecord(&client, allocator, session.did, "site.standard.document", tid.str(), doc_record); 184 + 185 + // new: also publish pub.leaflet.document 186 + const leaflet_record = try markdownToLeaflet(allocator, content, title, session.did, pub_uri); 187 + try putRecord(&client, allocator, session.did, "pub.leaflet.document", tid.str(), leaflet_record); 188 + ``` 189 + 190 + ## complexity estimate 191 + 192 + | component | complexity | notes | 193 + |-----------|------------|-------| 194 + | block parsing | medium | regex-free, line-by-line | 195 + | facet extraction | medium | byte index tracking is fiddly | 196 + | image upload | low | already have blob upload in xrpc | 197 + | json serialization | low | std.json handles it | 198 + | integration | low | add to existing publish flow | 199 + 200 + total: ~300-500 lines of zig 201 + 202 + ## open questions 203 + 204 + 1. **publication record**: do we need a `pub.leaflet.publication` too, or just documents? 205 + - leaflet allows standalone documents without publications 206 + - could skip publication for now 207 + 208 + 2. **image handling**: 209 + - option A: skip images initially (just text content) 210 + - option B: require images to be URLs (no blob upload) 211 + - option C: full blob upload support 212 + 213 + 3. **deduplication**: same rkey for both record types? 214 + - pro: easy to correlate 215 + - con: different collections, might not matter 216 + 217 + 4. **validation**: leaflet has a validate endpoint 218 + - could call `/api/unstable_validate` to check records before publish 219 + - probably skip for v1 220 + 221 + ## references 222 + 223 + - [pub.leaflet.document schema](/tmp/leaflet/lexicons/pub/leaflet/document.json) 224 + - [leaflet publishToPublication.ts](/tmp/leaflet/actions/publishToPublication.ts) - how leaflet creates records 225 + - [site.standard.document schema](/tmp/standard.site/app/data/lexicons/document.json) 226 + - paul's site: fetches records, doesn't publish them

+272

docs/scratch/logfire-zig-adoption.md

··· 1 + # logfire-zig adoption guide for leaflet-search 2 + 3 + guide for integrating logfire-zig into the leaflet-search backend. 4 + 5 + ## 1. add dependency 6 + 7 + in `backend/build.zig.zon`: 8 + 9 + ```zig 10 + .dependencies = .{ 11 + // ... existing deps ... 12 + .logfire = .{ 13 + .url = "https://tangled.sh/zzstoatzz.io/logfire-zig/archive/main", 14 + .hash = "...", // run zig build to get hash 15 + }, 16 + }, 17 + ``` 18 + 19 + in `backend/build.zig`, add the import: 20 + 21 + ```zig 22 + const logfire = b.dependency("logfire", .{ 23 + .target = target, 24 + .optimize = optimize, 25 + }); 26 + exe.root_module.addImport("logfire", logfire.module("logfire")); 27 + ``` 28 + 29 + ## 2. configure in main.zig 30 + 31 + ```zig 32 + const std = @import("std"); 33 + const logfire = @import("logfire"); 34 + // ... other imports ... 35 + 36 + pub fn main() !void { 37 + var gpa = std.heap.GeneralPurposeAllocator(.{}){}; 38 + defer _ = gpa.deinit(); 39 + const allocator = gpa.allocator(); 40 + 41 + // configure logfire early 42 + // reads LOGFIRE_WRITE_TOKEN from env automatically 43 + const lf = try logfire.configure(.{ 44 + .service_name = "leaflet-search", 45 + .service_version = "0.0.1", 46 + .environment = std.posix.getenv("FLY_APP_NAME") orelse "development", 47 + }); 48 + defer lf.shutdown(); 49 + 50 + logfire.info("starting leaflet-search on port {d}", .{port}); 51 + 52 + // ... rest of main ... 53 + } 54 + ``` 55 + 56 + ## 3. replace timing.zig with spans 57 + 58 + current pattern in server.zig: 59 + 60 + ```zig 61 + fn handleSearch(request: *http.Server.Request, target: []const u8) !void { 62 + const start_time = std.time.microTimestamp(); 63 + defer timing.record(.search, start_time); 64 + // ... 65 + } 66 + ``` 67 + 68 + with logfire: 69 + 70 + ```zig 71 + fn handleSearch(request: *http.Server.Request, target: []const u8) !void { 72 + const span = logfire.span("search.handle", .{}); 73 + defer span.end(); 74 + 75 + // parse params 76 + const query = parseQueryParam(alloc, target, "q") catch ""; 77 + 78 + // add attributes after parsing 79 + span.setAttribute("query", query); 80 + span.setAttribute("tag", tag_filter orelse ""); 81 + 82 + // ... 83 + } 84 + ``` 85 + 86 + for nested operations: 87 + 88 + ```zig 89 + fn search(alloc: Allocator, query: []const u8, ...) ![]Result { 90 + const span = logfire.span("search.execute", .{ 91 + .query_length = @intCast(query.len), 92 + }); 93 + defer span.end(); 94 + 95 + // FTS query 96 + { 97 + const fts_span = logfire.span("search.fts", .{}); 98 + defer fts_span.end(); 99 + // ... FTS logic ... 100 + } 101 + 102 + // vector search fallback 103 + if (results.len < limit) { 104 + const vec_span = logfire.span("search.vector", .{}); 105 + defer vec_span.end(); 106 + // ... vector search ... 107 + } 108 + 109 + return results; 110 + } 111 + ``` 112 + 113 + ## 4. add structured logging 114 + 115 + replace `std.debug.print` with logfire: 116 + 117 + ```zig 118 + // before 119 + std.debug.print("accept error: {}\n", .{err}); 120 + 121 + // after 122 + logfire.err("accept error: {}", .{err}); 123 + ``` 124 + 125 + ```zig 126 + // before 127 + std.debug.print("{s} listening on http://0.0.0.0:{d}\n", .{app_name, port}); 128 + 129 + // after 130 + logfire.info("{s} listening on port {d}", .{app_name, port}); 131 + ``` 132 + 133 + for sync operations in tap.zig: 134 + 135 + ```zig 136 + logfire.info("sync complete", .{}); 137 + logfire.debug("processed {d} events", .{event_count}); 138 + ``` 139 + 140 + for errors: 141 + 142 + ```zig 143 + logfire.err("turso query failed: {}", .{@errorName(err)}); 144 + ``` 145 + 146 + ## 5. add metrics 147 + 148 + replace stats.zig counters with logfire metrics: 149 + 150 + ```zig 151 + // before (in stats.zig) 152 + pub fn recordSearch(query: []const u8) void { 153 + total_searches.fetchAdd(1, .monotonic); 154 + // ... 155 + } 156 + 157 + // with logfire (in server.zig or stats.zig) 158 + pub fn recordSearch(query: []const u8) void { 159 + logfire.counter("search.total", 1); 160 + // existing logic... 161 + } 162 + ``` 163 + 164 + for gauges (e.g., active connections, document counts): 165 + 166 + ```zig 167 + logfire.gaugeInt("documents.indexed", doc_count); 168 + logfire.gaugeInt("connections.active", active_count); 169 + ``` 170 + 171 + for latency histograms (more detail than counter): 172 + 173 + ```zig 174 + // after search completes 175 + logfire.metric(.{ 176 + .name = "search.latency_ms", 177 + .unit = "ms", 178 + .data = .{ 179 + .histogram = .{ 180 + .data_points = &[_]logfire.HistogramDataPoint{.{ 181 + .start_time_ns = start_ns, 182 + .time_ns = std.time.nanoTimestamp(), 183 + .count = 1, 184 + .sum = latency_ms, 185 + .bucket_counts = ..., 186 + .explicit_bounds = ..., 187 + .min = latency_ms, 188 + .max = latency_ms, 189 + }}, 190 + }, 191 + }, 192 + }); 193 + ``` 194 + 195 + ## 6. deployment 196 + 197 + add to fly.toml secrets: 198 + 199 + ```bash 200 + fly secrets set LOGFIRE_WRITE_TOKEN=pylf_v1_us_xxxxx --app leaflet-search-backend 201 + ``` 202 + 203 + logfire-zig reads from `LOGFIRE_WRITE_TOKEN` or `LOGFIRE_TOKEN` automatically. 204 + 205 + ## 7. what to keep from existing code 206 + 207 + **keep timing.zig** - it provides local latency histograms for the dashboard API. logfire spans complement this with distributed tracing. 208 + 209 + **keep stats.zig** - local counters are still useful for the `/stats` endpoint. logfire metrics add remote observability. 210 + 211 + **keep activity.zig** - tracks recent activity for the dashboard. orthogonal to logfire. 212 + 213 + the pattern is: local state for dashboard UI, logfire for observability. 214 + 215 + ## 8. migration order 216 + 217 + 1. add dependency, configure in main.zig 218 + 2. add spans to request handlers (search, similar, tags, popular) 219 + 3. add structured logging for errors and important events 220 + 4. add metrics for key counters 221 + 5. gradually replace `std.debug.print` with logfire logging 222 + 6. consider removing timing.zig if logfire histograms are sufficient 223 + 224 + ## 9. example: full search handler 225 + 226 + ```zig 227 + fn handleSearch(request: *http.Server.Request, target: []const u8) !void { 228 + const span = logfire.span("http.search", .{}); 229 + defer span.end(); 230 + 231 + var arena = std.heap.ArenaAllocator.init(std.heap.page_allocator); 232 + defer arena.deinit(); 233 + const alloc = arena.allocator(); 234 + 235 + const query = parseQueryParam(alloc, target, "q") catch ""; 236 + const tag_filter = parseQueryParam(alloc, target, "tag") catch null; 237 + 238 + if (query.len == 0 and tag_filter == null) { 239 + logfire.debug("empty search request", .{}); 240 + try sendJson(request, "{\"error\":\"enter a search term\"}"); 241 + return; 242 + } 243 + 244 + const results = search.search(alloc, query, tag_filter, null, null) catch |err| { 245 + logfire.err("search failed: {}", .{@errorName(err)}); 246 + stats.recordError(); 247 + return err; 248 + }; 249 + 250 + logfire.counter("search.requests", 1); 251 + logfire.info("search completed", .{}); 252 + 253 + // ... send response ... 254 + } 255 + ``` 256 + 257 + ## 10. verifying it works 258 + 259 + run locally: 260 + 261 + ```bash 262 + LOGFIRE_WRITE_TOKEN=pylf_v1_us_xxx zig build run 263 + ``` 264 + 265 + check logfire dashboard for traces from `leaflet-search` service. 266 + 267 + without token (console fallback): 268 + 269 + ```bash 270 + zig build run 271 + # prints [span], [info], [metric] to stderr 272 + ```

+350

docs/scratch/standard-search-planning.md

··· 1 + # standard-search planning 2 + 3 + expanding leaflet-search to index all standard.site records. 4 + 5 + ## references 6 + 7 + - [standard.site](https://standard.site/) - shared lexicons for long-form publishing on ATProto 8 + - [leaflet.pub](https://leaflet.pub/) - implements `pub.leaflet.*` lexicons 9 + - [pckt.blog](https://pckt.blog/) - implements `blog.pckt.*` lexicons 10 + - [offprint.app](https://offprint.app/) - implements `app.offprint.*` lexicons 11 + - [ATProto docs](https://atproto.com/docs) - protocol documentation 12 + 13 + ## context 14 + 15 + discussion with pckt.blog team about building global search for standard.site ecosystem. 16 + current leaflet-search is tightly coupled to `pub.leaflet.*` lexicons. 17 + 18 + ### recent work (2026-01-05) 19 + 20 + added similarity cache to improve `/similar` endpoint performance: 21 + - `similarity_cache` table stores computed results keyed by `(source_uri, doc_count)` 22 + - cache auto-invalidates when document count changes 23 + - `/stats` endpoint now shows `cache_hits` and `cache_misses` 24 + - first request ~3s (cold), cached requests ~0.15s 25 + 26 + also added loading indicator for "related to" results in frontend. 27 + 28 + ### recent work (2026-01-06) 29 + 30 + - merged PR1: multi-platform schema (platform + source_collection columns) 31 + - added `loading.js` - portable loading state handler for dashboards 32 + - skeleton shimmer while loading 33 + - "waking up" toast after 2s threshold (fly.io cold start handling) 34 + - designed to be copied to other projects 35 + - fixed pluralization ("1 result" vs "2 results") 36 + 37 + ## what we know 38 + 39 + ### standard.site lexicons 40 + 41 + two shared lexicons for long-form publishing on ATProto: 42 + - `site.standard.document` - document content and metadata 43 + - `site.standard.publication` - publication/blog metadata 44 + 45 + implementing platforms: 46 + - leaflet.pub (`pub.leaflet.*`) 47 + - pckt.blog (`blog.pckt.*`) 48 + - offprint.app (`app.offprint.*`) 49 + 50 + ### site.standard.document schema 51 + 52 + examined real records from pckt.blog. key fields: 53 + 54 + ``` 55 + textContent - PRE-FLATTENED TEXT FOR SEARCH (the holy grail) 56 + content - platform-specific block structure 57 + .$type - identifies platform (e.g., "blog.pckt.content") 58 + title - document title 59 + tags - array of strings 60 + site - AT-URI reference to site.standard.publication 61 + path - URL path (e.g., "/my-post-abc123") 62 + publishedAt - ISO timestamp 63 + updatedAt - ISO timestamp 64 + coverImage - blob reference 65 + ``` 66 + 67 + ### the textContent field 68 + 69 + this is huge. platforms flatten their block content into a single text field: 70 + 71 + ```json 72 + { 73 + "content": { 74 + "$type": "blog.pckt.content", 75 + "items": [ /* platform-specific blocks */ ] 76 + }, 77 + "textContent": "i have been writing a lot of atproto things in zig!..." 78 + } 79 + ``` 80 + 81 + no need to parse platform-specific blocks - just index `textContent` directly. 82 + 83 + ### platform detection 84 + 85 + derive platform from `content.$type` prefix: 86 + - `blog.pckt.content` → pckt 87 + - `pub.leaflet.content` → leaflet (TBD - need to verify) 88 + - `app.offprint.content` → offprint (TBD - need to verify) 89 + 90 + ### current leaflet-search architecture 91 + 92 + ``` 93 + ATProto firehose (via tap) 94 + ↓ 95 + tap.zig - subscribes to pub.leaflet.document/publication 96 + ↓ 97 + indexer.zig - extracts content from nested pages[].blocks[] structure 98 + ↓ 99 + turso (sqlite) - documents table + FTS5 + embeddings 100 + ↓ 101 + search.zig - FTS5 queries + vector similarity 102 + ↓ 103 + server.zig - HTTP API (/search, /similar, /stats) 104 + ``` 105 + 106 + leaflet-specific code: 107 + - tap.zig lines 10-11: hardcoded collection names 108 + - tap.zig lines 234-268: block type extraction (pub.leaflet.blocks.*) 109 + - recursive page/block traversal logic 110 + 111 + generalizable code: 112 + - database schema (FTS5, tags, stats, similarity cache) 113 + - search/similar logic 114 + - HTTP API 115 + - embedding pipeline 116 + 117 + ## proposed architecture for standard-search 118 + 119 + ### ingestion changes 120 + 121 + subscribe to: 122 + - `site.standard.document` 123 + - `site.standard.publication` 124 + 125 + optionally also subscribe to platform-specific collections for richer data: 126 + - `pub.leaflet.document/publication` 127 + - `blog.pckt.document/publication` (if they have these) 128 + - `app.offprint.document/publication` (if they have these) 129 + 130 + ### content extraction 131 + 132 + for `site.standard.document`: 133 + 1. use `textContent` field directly - no block parsing! 134 + 2. fall back to title + description if textContent missing 135 + 136 + for platform-specific records (if needed): 137 + - keep existing leaflet block parser 138 + - add parsers for other platforms as needed 139 + 140 + ### database changes 141 + 142 + add to documents table: 143 + - `platform` TEXT - derived from content.$type (leaflet, pckt, offprint) 144 + - `source_collection` TEXT - the actual lexicon (site.standard.document, pub.leaflet.document) 145 + - `standard_uri` TEXT - if platform-specific record, link to corresponding site.standard.document 146 + 147 + ### API changes 148 + 149 + - `/search?q=...&platform=leaflet` - optional platform filter 150 + - results include `platform` field 151 + - `/similar` works across all platforms 152 + 153 + ### naming/deployment 154 + 155 + options: 156 + 1. rename leaflet-search → standard-search (breaking change) 157 + 2. new repo/deployment, keep leaflet-search as-is 158 + 3. branch and generalize, decide naming later 159 + 160 + leaning toward option 3 for now. 161 + 162 + ## findings from exploration 163 + 164 + ### pckt.blog - READY 165 + - writes `site.standard.document` records 166 + - has `textContent` field (pre-flattened) 167 + - `content.$type` = `blog.pckt.content` 168 + - 6+ records found on pckt.blog service account 169 + 170 + ### leaflet.pub - NOT YET MIGRATED 171 + - still using `pub.leaflet.document` only 172 + - no `site.standard.document` records found 173 + - no `textContent` field - content is in nested `pages[].blocks[]` 174 + - will need to continue parsing blocks OR wait for migration 175 + 176 + ### offprint.app - NOW INDEXED (2026-01-22) 177 + - writes `site.standard.document` records with `app.offprint.content` blocks 178 + - has `textContent` field (pre-flattened) 179 + - platform detected via basePath (`*.offprint.app`, `*.offprint.test`) 180 + - now fully supported alongside leaflet and pckt 181 + 182 + ### greengale.app - NOW INDEXED (2026-01-22) 183 + - writes `site.standard.document` records 184 + - has `textContent` field (pre-flattened) 185 + - platform detected via basePath (`greengale.app/*`) 186 + - ~29 documents indexed at time of discovery 187 + - now fully supported alongside leaflet, pckt, and offprint 188 + 189 + ### implication for architecture 190 + 191 + two paths: 192 + 193 + **path A: wait for leaflet migration** 194 + - simpler: just index `site.standard.document` with `textContent` 195 + - all platforms converge on same schema 196 + - downside: loses existing leaflet search until they migrate 197 + 198 + **path B: hybrid approach** 199 + - index `site.standard.document` (pckt, future leaflet, offprint) 200 + - ALSO index `pub.leaflet.document` with existing block parser 201 + - dedupe by URI or store both with `source_collection` indicator 202 + - more complex but maintains backwards compat 203 + 204 + leaning toward **path B** - can't lose 3500 leaflet docs. 205 + 206 + ## open questions 207 + 208 + - [x] does leaflet write site.standard.document records? **NO, not yet** 209 + - [x] does offprint write site.standard.document records? **UNKNOWN - no public content yet** 210 + - [ ] when will leaflet migrate to standard.site? 211 + - [ ] should we dedupe platform-specific vs standard records? 212 + - [ ] embeddings: regenerate for all, or use same model? 213 + 214 + ## implementation plan (PRs) 215 + 216 + breaking work into reviewable chunks: 217 + 218 + ### PR1: database schema for multi-platform ✅ MERGED 219 + - add `platform TEXT` column to documents (default 'leaflet') 220 + - add `source_collection TEXT` column (default 'pub.leaflet.document') 221 + - backfill existing ~3500 records 222 + - no behavior change, just schema prep 223 + - https://github.com/zzstoatzz/leaflet-search/pull/1 224 + 225 + ### PR2: generalized content extraction 226 + - new `extractor.zig` module with platform-agnostic interface 227 + - `textContent` extraction for standard.site records 228 + - keep existing block parser for `pub.leaflet.*` 229 + - platform detection from `content.$type` 230 + 231 + ### PR3: tap subscriber for site.standard.document 232 + - subscribe to `site.standard.document` + `site.standard.publication` 233 + - route to appropriate extractor 234 + - starts ingesting pckt.blog content 235 + 236 + ### PR4: API platform filter 237 + - add `?platform=` query param to `/search` 238 + - include `platform` field in results 239 + - frontend: show platform badge, optional filter 240 + 241 + ### PR5 (optional, separate track): witness cache 242 + - `witness_cache` table for raw records 243 + - replay tooling for backfills 244 + - independent of above work 245 + 246 + ## operational notes 247 + 248 + - **cloudflare pages**: `leaflet-search` does NOT auto-deploy from git. manual deploy required: 249 + ```bash 250 + wrangler pages deploy site --project-name leaflet-search 251 + ``` 252 + - **fly.io backend**: deploy from backend directory: 253 + ```bash 254 + cd backend && fly deploy 255 + ``` 256 + - **git remotes**: push to both `origin` (tangled.sh) and `github` (for MCP + PRs) 257 + 258 + ## next steps 259 + 260 + 1. ~~verify leaflet's site.standard.document structure~~ (done - they don't have any) 261 + 2. ~~find and examine offprint records~~ (done - no public content yet) 262 + 3. ~~PR1: database schema~~ (merged) 263 + 4. PR2: generalized content extraction 264 + 5. PR3: tap subscriber 265 + 6. PR4: API platform filter 266 + 7. consider witness cache architecture (see below) 267 + 268 + --- 269 + 270 + ## architectural consideration: witness cache 271 + 272 + [paul frazee's post on witness caches](https://bsky.app/profile/pfrazee.com/post/3lfarplxvcs2e) (2026-01-05): 273 + 274 + > I'm increasingly convinced that many Atmosphere backends start with a local "witness cache" of the repositories. 275 + > 276 + > A witness cache is a copy of the repository records, plus a timestamp of when the record was indexed (the "witness time") which you want to keep 277 + > 278 + > The key feature is: you can replay it 279 + 280 + > With local replay, you can add new tables or indexes to your backend and quickly backfill the data. If you don't have a witness cache, you would have to do backfill from the network, which is slow 281 + 282 + ### current leaflet-search architecture (no witness cache) 283 + 284 + ``` 285 + Firehose → tap → Parse & Transform → Store DERIVED data → Discard raw record 286 + ``` 287 + 288 + we store: 289 + - `uri`, `did`, `rkey` 290 + - `title` (extracted) 291 + - `content` (flattened from blocks) 292 + - `created_at`, `publication_uri` 293 + 294 + we discard: the raw record JSON 295 + 296 + ### witness cache architecture 297 + 298 + ``` 299 + Firehose → Store RAW record + witness_time → Derive indexes on demand (replayable) 300 + ``` 301 + 302 + would store: 303 + - `uri`, `collection`, `rkey` 304 + - `raw_record` (full JSON blob) 305 + - `witness_time` (when we indexed it) 306 + 307 + then derive FTS, embeddings, etc. from local data via replay. 308 + 309 + ### comparison 310 + 311 + | scenario | current (no cache) | with witness cache | 312 + |----------|-------------------|-------------------| 313 + | add new parser (offprint) | re-crawl network | replay local | 314 + | leaflet adds textContent | wait for new records | replay & re-extract | 315 + | fix parsing bug | re-crawl affected | replay & re-derive | 316 + | change embedding model | re-fetch content | replay local | 317 + | add new index/table | backfill from network | replay locally | 318 + 319 + ### trade-offs 320 + 321 + **storage cost:** 322 + - ~3500 docs × ~10KB avg = ~35MB (not huge) 323 + - turso free tier: 9GB, so plenty of room 324 + 325 + **complexity:** 326 + - two-phase: store raw, then derive 327 + - vs current one-phase: derive immediately 328 + 329 + **benefits for standard-search:** 330 + - could add offprint/pckt parsers and replay existing data 331 + - when leaflet migrates to standard.site, re-derive without network 332 + - embedding backfill becomes local-only (no voyage API for content fetch) 333 + 334 + ### implementation options 335 + 336 + 1. **add `raw_record TEXT` column to existing tables** 337 + - simple, backwards compatible 338 + - can migrate incrementally 339 + 340 + 2. **separate `witness_cache` table** 341 + - `(uri PRIMARY KEY, collection, raw_record, witness_time)` 342 + - cleaner separation of concerns 343 + - documents/publications tables become derived views 344 + 345 + 3. **use duckdb/clickhouse for witness cache** (paul's suggestion) 346 + - better compression for JSON blobs 347 + - good for analytics queries 348 + - adds operational complexity 349 + 350 + for our scale, option 1 or 2 with turso is probably fine.

+124

docs/search-architecture.md

··· 1 + # search architecture 2 + 3 + current state, rationale, and future options. 4 + 5 + ## current: SQLite FTS5 6 + 7 + we use SQLite's built-in full-text search (FTS5) via Turso. 8 + 9 + ### why FTS5 works for now 10 + 11 + - **scale**: ~3500 documents. FTS5 handles this trivially. 12 + - **latency**: 10-50ms for search queries. fine for our use case. 13 + - **cost**: $0. included with Turso free tier. 14 + - **ops**: zero. no separate service to run. 15 + - **simplicity**: one database for everything (docs, FTS, vectors, cache). 16 + 17 + ### how it works 18 + 19 + ``` 20 + user query: "crypto-casino" 21 + ↓ 22 + buildFtsQuery(): "crypto OR casino*" 23 + ↓ 24 + FTS5 MATCH query with BM25 + recency decay 25 + ↓ 26 + results with snippet() 27 + ``` 28 + 29 + key decisions: 30 + - **OR between terms** for better recall (deliberate, see commit 35ad4b5) 31 + - **prefix match on last word** for type-ahead feel 32 + - **unicode61 tokenizer** splits on non-alphanumeric (we match this in buildFtsQuery) 33 + - **recency decay** boosts recent docs: `ORDER BY rank + (days_old / 30)` 34 + 35 + ### what's coupled to FTS5 36 + 37 + all in `backend/src/search.zig`: 38 + 39 + | component | FTS5-specific | 40 + |-----------|---------------| 41 + | 10 query definitions | `MATCH`, `snippet()`, `ORDER BY rank` | 42 + | `buildFtsQuery()` | constructs FTS5 syntax | 43 + | schema | `documents_fts`, `publications_fts` virtual tables | 44 + 45 + ### what's already decoupled 46 + 47 + - result types (`SearchResultJson`, `Doc`, `Pub`) 48 + - similarity search (uses `vector_distance_cos`, not FTS5) 49 + - caching logic 50 + - HTTP layer (server.zig just calls `search()`) 51 + 52 + ### known limitations 53 + 54 + - **no typo tolerance**: "leafet" won't find "leaflet" 55 + - **no relevance tuning**: can't boost title vs content 56 + - **single writer**: SQLite write lock 57 + - **no horizontal scaling**: single database 58 + 59 + these aren't problems at current scale. 60 + 61 + ## future: if we need to scale 62 + 63 + ### when to consider switching 64 + 65 + - search latency consistently >100ms 66 + - write contention from indexing 67 + - need typo tolerance or better relevance 68 + - millions of documents 69 + 70 + ### recommended: Elasticsearch 71 + 72 + Elasticsearch is the battle-tested choice for production search: 73 + 74 + - proven at massive scale (Wikipedia, GitHub, Stack Overflow) 75 + - rich query DSL, analyzers, aggregations 76 + - typo tolerance via fuzzy matching 77 + - horizontal scaling built-in 78 + - extensive tooling and community 79 + 80 + trade-offs: 81 + - operational complexity (JVM, cluster management) 82 + - resource hungry (~2GB+ RAM minimum) 83 + - cost: $50-500/month depending on scale 84 + 85 + ### alternatives considered 86 + 87 + **Meilisearch/Typesense**: simpler, lighter, great defaults. good for straightforward search but less proven at scale. would work fine for this use case but Elasticsearch has more headroom. 88 + 89 + **Algolia**: fully managed, excellent but expensive. makes sense if you want zero ops. 90 + 91 + **PostgreSQL full-text**: if already on Postgres. not as good as FTS5 or Elasticsearch but one less system. 92 + 93 + ### migration path 94 + 95 + 1. keep Turso as source of truth 96 + 2. add Elasticsearch as search index 97 + 3. sync documents to ES on write (async) 98 + 4. point `/search` at Elasticsearch 99 + 5. keep `/similar` on Turso (vector search) 100 + 101 + the `search()` function would change from SQL queries to ES client calls. result types stay the same. HTTP layer unchanged. 102 + 103 + estimated effort: 1-2 days to swap search backend. 104 + 105 + ### vector search scaling 106 + 107 + similarity search currently uses brute-force `vector_distance_cos` with caching. at scale: 108 + 109 + - **Elasticsearch**: has vector search (dense_vector + kNN) 110 + - **dedicated vector DB**: Qdrant, Pinecone, Weaviate 111 + - **pgvector**: if on Postgres 112 + 113 + could consolidate text + vector in Elasticsearch, or keep them separate. 114 + 115 + ## summary 116 + 117 + | scale | recommendation | 118 + |-------|----------------| 119 + | <10k docs | keep FTS5 (current) | 120 + | 10k-100k docs | still probably fine, monitor latency | 121 + | 100k+ docs | consider Elasticsearch | 122 + | millions + sub-ms latency | Elasticsearch cluster + caching layer | 123 + 124 + we're in the "keep FTS5" zone. the code is structured to swap later if needed.

-343

docs/standard-search-planning.md

··· 1 - # standard-search planning 2 - 3 - expanding leaflet-search to index all standard.site records. 4 - 5 - ## references 6 - 7 - - [standard.site](https://standard.site/) - shared lexicons for long-form publishing on ATProto 8 - - [leaflet.pub](https://leaflet.pub/) - implements `pub.leaflet.*` lexicons 9 - - [pckt.blog](https://pckt.blog/) - implements `blog.pckt.*` lexicons 10 - - [offprint.app](https://offprint.app/) - implements `app.offprint.*` lexicons (early beta) 11 - - [ATProto docs](https://atproto.com/docs) - protocol documentation 12 - 13 - ## context 14 - 15 - discussion with pckt.blog team about building global search for standard.site ecosystem. 16 - current leaflet-search is tightly coupled to `pub.leaflet.*` lexicons. 17 - 18 - ### recent work (2026-01-05) 19 - 20 - added similarity cache to improve `/similar` endpoint performance: 21 - - `similarity_cache` table stores computed results keyed by `(source_uri, doc_count)` 22 - - cache auto-invalidates when document count changes 23 - - `/stats` endpoint now shows `cache_hits` and `cache_misses` 24 - - first request ~3s (cold), cached requests ~0.15s 25 - 26 - also added loading indicator for "related to" results in frontend. 27 - 28 - ### recent work (2026-01-06) 29 - 30 - - merged PR1: multi-platform schema (platform + source_collection columns) 31 - - added `loading.js` - portable loading state handler for dashboards 32 - - skeleton shimmer while loading 33 - - "waking up" toast after 2s threshold (fly.io cold start handling) 34 - - designed to be copied to other projects 35 - - fixed pluralization ("1 result" vs "2 results") 36 - 37 - ## what we know 38 - 39 - ### standard.site lexicons 40 - 41 - two shared lexicons for long-form publishing on ATProto: 42 - - `site.standard.document` - document content and metadata 43 - - `site.standard.publication` - publication/blog metadata 44 - 45 - implementing platforms: 46 - - leaflet.pub (`pub.leaflet.*`) 47 - - pckt.blog (`blog.pckt.*`) 48 - - offprint.app (`app.offprint.*`) 49 - 50 - ### site.standard.document schema 51 - 52 - examined real records from pckt.blog. key fields: 53 - 54 - ``` 55 - textContent - PRE-FLATTENED TEXT FOR SEARCH (the holy grail) 56 - content - platform-specific block structure 57 - .$type - identifies platform (e.g., "blog.pckt.content") 58 - title - document title 59 - tags - array of strings 60 - site - AT-URI reference to site.standard.publication 61 - path - URL path (e.g., "/my-post-abc123") 62 - publishedAt - ISO timestamp 63 - updatedAt - ISO timestamp 64 - coverImage - blob reference 65 - ``` 66 - 67 - ### the textContent field 68 - 69 - this is huge. platforms flatten their block content into a single text field: 70 - 71 - ```json 72 - { 73 - "content": { 74 - "$type": "blog.pckt.content", 75 - "items": [ /* platform-specific blocks */ ] 76 - }, 77 - "textContent": "i have been writing a lot of atproto things in zig!..." 78 - } 79 - ``` 80 - 81 - no need to parse platform-specific blocks - just index `textContent` directly. 82 - 83 - ### platform detection 84 - 85 - derive platform from `content.$type` prefix: 86 - - `blog.pckt.content` → pckt 87 - - `pub.leaflet.content` → leaflet (TBD - need to verify) 88 - - `app.offprint.content` → offprint (TBD - need to verify) 89 - 90 - ### current leaflet-search architecture 91 - 92 - ``` 93 - ATProto firehose (via tap) 94 - ↓ 95 - tap.zig - subscribes to pub.leaflet.document/publication 96 - ↓ 97 - indexer.zig - extracts content from nested pages[].blocks[] structure 98 - ↓ 99 - turso (sqlite) - documents table + FTS5 + embeddings 100 - ↓ 101 - search.zig - FTS5 queries + vector similarity 102 - ↓ 103 - server.zig - HTTP API (/search, /similar, /stats) 104 - ``` 105 - 106 - leaflet-specific code: 107 - - tap.zig lines 10-11: hardcoded collection names 108 - - tap.zig lines 234-268: block type extraction (pub.leaflet.blocks.*) 109 - - recursive page/block traversal logic 110 - 111 - generalizable code: 112 - - database schema (FTS5, tags, stats, similarity cache) 113 - - search/similar logic 114 - - HTTP API 115 - - embedding pipeline 116 - 117 - ## proposed architecture for standard-search 118 - 119 - ### ingestion changes 120 - 121 - subscribe to: 122 - - `site.standard.document` 123 - - `site.standard.publication` 124 - 125 - optionally also subscribe to platform-specific collections for richer data: 126 - - `pub.leaflet.document/publication` 127 - - `blog.pckt.document/publication` (if they have these) 128 - - `app.offprint.document/publication` (if they have these) 129 - 130 - ### content extraction 131 - 132 - for `site.standard.document`: 133 - 1. use `textContent` field directly - no block parsing! 134 - 2. fall back to title + description if textContent missing 135 - 136 - for platform-specific records (if needed): 137 - - keep existing leaflet block parser 138 - - add parsers for other platforms as needed 139 - 140 - ### database changes 141 - 142 - add to documents table: 143 - - `platform` TEXT - derived from content.$type (leaflet, pckt, offprint) 144 - - `source_collection` TEXT - the actual lexicon (site.standard.document, pub.leaflet.document) 145 - - `standard_uri` TEXT - if platform-specific record, link to corresponding site.standard.document 146 - 147 - ### API changes 148 - 149 - - `/search?q=...&platform=leaflet` - optional platform filter 150 - - results include `platform` field 151 - - `/similar` works across all platforms 152 - 153 - ### naming/deployment 154 - 155 - options: 156 - 1. rename leaflet-search → standard-search (breaking change) 157 - 2. new repo/deployment, keep leaflet-search as-is 158 - 3. branch and generalize, decide naming later 159 - 160 - leaning toward option 3 for now. 161 - 162 - ## findings from exploration 163 - 164 - ### pckt.blog - READY 165 - - writes `site.standard.document` records 166 - - has `textContent` field (pre-flattened) 167 - - `content.$type` = `blog.pckt.content` 168 - - 6+ records found on pckt.blog service account 169 - 170 - ### leaflet.pub - NOT YET MIGRATED 171 - - still using `pub.leaflet.document` only 172 - - no `site.standard.document` records found 173 - - no `textContent` field - content is in nested `pages[].blocks[]` 174 - - will need to continue parsing blocks OR wait for migration 175 - 176 - ### offprint.app - LIKELY EARLY BETA 177 - - no `site.standard.document` records found on offprint.app account 178 - - no `app.offprint.document` collection visible 179 - - website shows no example users/content 180 - - probably in early/private beta - no public records yet 181 - 182 - ### implication for architecture 183 - 184 - two paths: 185 - 186 - **path A: wait for leaflet migration** 187 - - simpler: just index `site.standard.document` with `textContent` 188 - - all platforms converge on same schema 189 - - downside: loses existing leaflet search until they migrate 190 - 191 - **path B: hybrid approach** 192 - - index `site.standard.document` (pckt, future leaflet, offprint) 193 - - ALSO index `pub.leaflet.document` with existing block parser 194 - - dedupe by URI or store both with `source_collection` indicator 195 - - more complex but maintains backwards compat 196 - 197 - leaning toward **path B** - can't lose 3500 leaflet docs. 198 - 199 - ## open questions 200 - 201 - - [x] does leaflet write site.standard.document records? **NO, not yet** 202 - - [x] does offprint write site.standard.document records? **UNKNOWN - no public content yet** 203 - - [ ] when will leaflet migrate to standard.site? 204 - - [ ] should we dedupe platform-specific vs standard records? 205 - - [ ] embeddings: regenerate for all, or use same model? 206 - 207 - ## implementation plan (PRs) 208 - 209 - breaking work into reviewable chunks: 210 - 211 - ### PR1: database schema for multi-platform ✅ MERGED 212 - - add `platform TEXT` column to documents (default 'leaflet') 213 - - add `source_collection TEXT` column (default 'pub.leaflet.document') 214 - - backfill existing ~3500 records 215 - - no behavior change, just schema prep 216 - - https://github.com/zzstoatzz/leaflet-search/pull/1 217 - 218 - ### PR2: generalized content extraction 219 - - new `extractor.zig` module with platform-agnostic interface 220 - - `textContent` extraction for standard.site records 221 - - keep existing block parser for `pub.leaflet.*` 222 - - platform detection from `content.$type` 223 - 224 - ### PR3: TAP subscriber for site.standard.document 225 - - subscribe to `site.standard.document` + `site.standard.publication` 226 - - route to appropriate extractor 227 - - starts ingesting pckt.blog content 228 - 229 - ### PR4: API platform filter 230 - - add `?platform=` query param to `/search` 231 - - include `platform` field in results 232 - - frontend: show platform badge, optional filter 233 - 234 - ### PR5 (optional, separate track): witness cache 235 - - `witness_cache` table for raw records 236 - - replay tooling for backfills 237 - - independent of above work 238 - 239 - ## operational notes 240 - 241 - - **cloudflare pages**: `leaflet-search` does NOT auto-deploy from git. manual deploy required: 242 - ```bash 243 - wrangler pages deploy site --project-name leaflet-search 244 - ``` 245 - - **fly.io backend**: deploy from backend directory: 246 - ```bash 247 - cd backend && fly deploy 248 - ``` 249 - - **git remotes**: push to both `origin` (tangled.sh) and `github` (for MCP + PRs) 250 - 251 - ## next steps 252 - 253 - 1. ~~verify leaflet's site.standard.document structure~~ (done - they don't have any) 254 - 2. ~~find and examine offprint records~~ (done - no public content yet) 255 - 3. ~~PR1: database schema~~ (merged) 256 - 4. PR2: generalized content extraction 257 - 5. PR3: TAP subscriber 258 - 6. PR4: API platform filter 259 - 7. consider witness cache architecture (see below) 260 - 261 - --- 262 - 263 - ## architectural consideration: witness cache 264 - 265 - [paul frazee's post on witness caches](https://bsky.app/profile/pfrazee.com/post/3lfarplxvcs2e) (2026-01-05): 266 - 267 - > I'm increasingly convinced that many Atmosphere backends start with a local "witness cache" of the repositories. 268 - > 269 - > A witness cache is a copy of the repository records, plus a timestamp of when the record was indexed (the "witness time") which you want to keep 270 - > 271 - > The key feature is: you can replay it 272 - 273 - > With local replay, you can add new tables or indexes to your backend and quickly backfill the data. If you don't have a witness cache, you would have to do backfill from the network, which is slow 274 - 275 - ### current leaflet-search architecture (no witness cache) 276 - 277 - ``` 278 - Firehose → TAP → Parse & Transform → Store DERIVED data → Discard raw record 279 - ``` 280 - 281 - we store: 282 - - `uri`, `did`, `rkey` 283 - - `title` (extracted) 284 - - `content` (flattened from blocks) 285 - - `created_at`, `publication_uri` 286 - 287 - we discard: the raw record JSON 288 - 289 - ### witness cache architecture 290 - 291 - ``` 292 - Firehose → Store RAW record + witness_time → Derive indexes on demand (replayable) 293 - ``` 294 - 295 - would store: 296 - - `uri`, `collection`, `rkey` 297 - - `raw_record` (full JSON blob) 298 - - `witness_time` (when we indexed it) 299 - 300 - then derive FTS, embeddings, etc. from local data via replay. 301 - 302 - ### comparison 303 - 304 - | scenario | current (no cache) | with witness cache | 305 - |----------|-------------------|-------------------| 306 - | add new parser (offprint) | re-crawl network | replay local | 307 - | leaflet adds textContent | wait for new records | replay & re-extract | 308 - | fix parsing bug | re-crawl affected | replay & re-derive | 309 - | change embedding model | re-fetch content | replay local | 310 - | add new index/table | backfill from network | replay locally | 311 - 312 - ### trade-offs 313 - 314 - **storage cost:** 315 - - ~3500 docs × ~10KB avg = ~35MB (not huge) 316 - - turso free tier: 9GB, so plenty of room 317 - 318 - **complexity:** 319 - - two-phase: store raw, then derive 320 - - vs current one-phase: derive immediately 321 - 322 - **benefits for standard-search:** 323 - - could add offprint/pckt parsers and replay existing data 324 - - when leaflet migrates to standard.site, re-derive without network 325 - - embedding backfill becomes local-only (no voyage API for content fetch) 326 - 327 - ### implementation options 328 - 329 - 1. **add `raw_record TEXT` column to existing tables** 330 - - simple, backwards compatible 331 - - can migrate incrementally 332 - 333 - 2. **separate `witness_cache` table** 334 - - `(uri PRIMARY KEY, collection, raw_record, witness_time)` 335 - - cleaner separation of concerns 336 - - documents/publications tables become derived views 337 - 338 - 3. **use duckdb/clickhouse for witness cache** (paul's suggestion) 339 - - better compression for JSON blobs 340 - - good for analytics queries 341 - - adds operational complexity 342 - 343 - for our scale, option 1 or 2 with turso is probably fine.

+215

docs/tap.md

··· 1 + # tap (firehose sync) 2 + 3 + leaflet-search uses [tap](https://github.com/bluesky-social/indigo/tree/main/cmd/tap) from bluesky-social/indigo to receive real-time events from the ATProto firehose. 4 + 5 + ## what is tap? 6 + 7 + tap subscribes to the ATProto firehose, filters for specific collections (e.g., `site.standard.document`), and broadcasts matching events to websocket clients. it also does initial crawling/backfilling of existing records. 8 + 9 + key behavior: **tap backfills historical data when repos are added**. when a repo is added to tracking: 10 + 1. tap fetches the full repo from the account's PDS using `com.atproto.sync.getRepo` 11 + 2. live firehose events during backfill are buffered in memory 12 + 3. historical events (marked `live: false`) are delivered first 13 + 4. after historical events complete, buffered live events are released 14 + 5. subsequent firehose events arrive immediately marked as `live: true` 15 + 16 + tap enforces strict per-repo ordering - live events are synchronization barriers that require all prior events to complete first. 17 + 18 + ## message format 19 + 20 + tap sends JSON messages over websocket. record events look like: 21 + 22 + ```json 23 + { 24 + "type": "record", 25 + "record": { 26 + "live": true, 27 + "did": "did:plc:abc123...", 28 + "rev": "3mbspmpaidl2a", 29 + "collection": "site.standard.document", 30 + "rkey": "3lzyrj6q6gs27", 31 + "action": "create", 32 + "record": { ... }, 33 + "cid": "bafyrei..." 34 + } 35 + } 36 + ``` 37 + 38 + ### field types (important!) 39 + 40 + | field | type | values | notes | 41 + |-------|------|--------|-------| 42 + | type | string | "record", "identity", "account" | message type | 43 + | action | **string** | "create", "update", "delete" | NOT an enum! | 44 + | live | bool | true/false | true = firehose, false = resync | 45 + | collection | string | e.g., "site.standard.document" | lexicon collection | 46 + 47 + ## gotchas 48 + 49 + 1. **action is a string, not an enum** - tap sends `"action": "create"` as a JSON string. if your parser expects an enum type, extraction will silently fail. use string comparison. 50 + 51 + 2. **collection filters apply during processing** - `TAP_COLLECTION_FILTERS` controls which records tap processes and sends to clients, both during live commits and resync CAR walks. records from other collections are skipped entirely. 52 + 53 + 3. **signal collection vs collection filters** - `TAP_SIGNAL_COLLECTION` controls auto-discovery of repos (which repos to track), while `TAP_COLLECTION_FILTERS` controls which records from those repos to output. a repo must either be auto-discovered via signal collection OR manually added via `/repos/add`. 54 + 55 + 4. **silent extraction failures** - if using zat's `extractAt`, enable debug logging to see why parsing fails: 56 + ```zig 57 + pub const std_options = .{ 58 + .log_scope_levels = &.{.{ .scope = .zat, .level = .debug }}, 59 + }; 60 + ``` 61 + this will show messages like: 62 + ``` 63 + debug(zat): extractAt: parse failed for Op at path { "op" }: InvalidEnumTag 64 + ``` 65 + 66 + ## memory and performance tuning 67 + 68 + tap loads **entire repo CARs into memory** during resync. some bsky users have repos that are 100-300MB+. this causes spiky memory usage that can OOM the machine. 69 + 70 + ### recommended settings for leaflet-search 71 + 72 + ```toml 73 + [[vm]] 74 + memory = '2gb' # 1gb is not enough 75 + 76 + [env] 77 + TAP_RESYNC_PARALLELISM = '1' # only one repo CAR in memory at a time (default: 5) 78 + TAP_FIREHOSE_PARALLELISM = '5' # concurrent event processors (default: 10) 79 + TAP_OUTBOX_CAPACITY = '10000' # event buffer size (default: 100000) 80 + TAP_IDENT_CACHE_SIZE = '10000' # identity cache entries (default: 2000000) 81 + ``` 82 + 83 + ### why these values? 84 + 85 + - **2GB memory**: 1GB causes OOM kills when resyncing large repos 86 + - **resync parallelism 1**: prevents multiple large CARs in memory simultaneously 87 + - **lower firehose/outbox**: we track ~1000 repos, not millions - defaults are overkill 88 + - **smaller ident cache**: we don't need 2M cached identities 89 + 90 + if tap keeps OOM'ing, check logs for large repo resyncs: 91 + ```bash 92 + fly logs -a leaflet-search-tap | grep "parsing repo CAR" | grep -E "size\":[0-9]{8,}" 93 + ``` 94 + 95 + ## quick status check 96 + 97 + from the `tap/` directory: 98 + ```bash 99 + just check 100 + ``` 101 + 102 + shows tap machine state, most recent indexed date, and 7-day timeline. useful for verifying indexing is working after restarts. 103 + 104 + example output: 105 + ``` 106 + === tap status === 107 + app 781417db604d48 23 ewr started ... 108 + 109 + === Recent Indexing Activity === 110 + Last indexed: 2026-01-08 (14 docs) 111 + Today: 2026-01-11 112 + Docs: 3742 | Pubs: 1231 113 + 114 + === Timeline (last 7 days) === 115 + 2026-01-08: 14 docs 116 + 2026-01-07: 29 docs 117 + ... 118 + ``` 119 + 120 + if "Last indexed" is more than a day behind "Today", tap may be down or catching up. 121 + 122 + ## checking catch-up progress 123 + 124 + when tap restarts after downtime, it replays the firehose from its saved cursor. to check progress: 125 + 126 + ```bash 127 + # see current firehose position (look for timestamps in log messages) 128 + fly logs -a leaflet-search-tap | grep -E '"time".*"seq"' | tail -3 129 + ``` 130 + 131 + the `"time"` field in log messages shows how far behind tap is. compare to current time to estimate catch-up. 132 + 133 + catch-up speed varies: 134 + - **~0.3x** when resync queue is full (large repos being fetched) 135 + - **~1x or faster** once resyncs clear 136 + 137 + ## debugging 138 + 139 + ### check tap connection 140 + ```bash 141 + fly logs -a leaflet-search-tap --no-tail | tail -30 142 + ``` 143 + 144 + look for: 145 + - `"connected to firehose"` - successfully connected to bsky relay 146 + - `"websocket connected"` - backend connected to tap 147 + - `"dialing failed"` / `"i/o timeout"` - network issues 148 + 149 + ### check backend is receiving 150 + ```bash 151 + fly logs -a leaflet-search-backend --no-tail | grep -E "(tap|indexed)" 152 + ``` 153 + 154 + look for: 155 + - `tap connected!` - connected to tap 156 + - `tap: msg_type=record` - receiving messages 157 + - `indexed document:` - successfully processing 158 + 159 + ### common issues 160 + 161 + | symptom | cause | fix | 162 + |---------|-------|-----| 163 + | tap machine stopped, `oom_killed=true` | large repo CARs exhausted memory | increase memory to 2GB, reduce `TAP_RESYNC_PARALLELISM` to 1 | 164 + | `websocket handshake failed: error.Timeout` | tap not running or network issue | restart tap, check regions match | 165 + | `dialing failed: lookup ... i/o timeout` | DNS issues reaching bsky relay | restart tap, transient network issue | 166 + | messages received but not indexed | extraction failing (type mismatch) | enable zat debug logging, check field types | 167 + | repo shows `records: 0` after adding | resync failed or collection not in filters | check tap logs for resync errors, verify `TAP_COLLECTION_FILTERS` | 168 + | new platform records not appearing | platform's collection not in `TAP_COLLECTION_FILTERS` | add collection to filters, restart tap | 169 + | indexing stopped, tap shows "started" | tap catching up from downtime | check firehose position in logs, wait for catch-up | 170 + 171 + ## tap API endpoints 172 + 173 + tap exposes HTTP endpoints for monitoring and control: 174 + 175 + | endpoint | description | 176 + |----------|-------------| 177 + | `/health` | health check | 178 + | `/stats/repo-count` | number of tracked repos | 179 + | `/stats/record-count` | total records processed | 180 + | `/stats/outbox-buffer` | events waiting to be sent | 181 + | `/stats/resync-buffer` | buffered commits for repos currently resyncing (NOT the resync queue) | 182 + | `/stats/cursors` | firehose cursor position | 183 + | `/info/:did` | repo status: `{"did":"...","state":"active","records":N}` | 184 + | `/repos/add` | POST with `{"dids":["did:plc:..."]}` to add repos | 185 + | `/repos/remove` | POST with `{"dids":["did:plc:..."]}` to remove repos | 186 + 187 + example: check repo status 188 + ```bash 189 + fly ssh console -a leaflet-search-tap -C "curl -s localhost:2480/info/did:plc:abc123" 190 + ``` 191 + 192 + example: manually add a repo for backfill 193 + ```bash 194 + fly ssh console -a leaflet-search-tap -C 'curl -X POST -H "Content-Type: application/json" -d "{\"dids\":[\"did:plc:abc123\"]}" localhost:2480/repos/add' 195 + ``` 196 + 197 + ## fly.io deployment 198 + 199 + both tap and backend should be in the same region for internal networking: 200 + 201 + ```bash 202 + # check current regions 203 + fly status -a leaflet-search-tap 204 + fly status -a leaflet-search-backend 205 + 206 + # restart tap if needed 207 + fly machine restart -a leaflet-search-tap <machine-id> 208 + ``` 209 + 210 + note: changing `primary_region` in fly.toml only affects new machines. to move existing machines, clone to new region and destroy old one. 211 + 212 + ## references 213 + 214 + - [tap source (bluesky-social/indigo)](https://github.com/bluesky-social/indigo/tree/main/cmd/tap) 215 + - [ATProto firehose docs](https://atproto.com/specs/sync#firehose)

+5 -5

mcp/README.md

··· 1 - # leaflet-mcp 1 + # pub search MCP 2 2 3 - MCP server for [Leaflet](https://leaflet.pub) - search decentralized publications on ATProto. 3 + MCP server for [pub search](https://pub-search.waow.tech) - search ATProto publishing platforms (Leaflet, pckt, standard.site). 4 4 5 5 ## usage 6 6 7 7 ### hosted (recommended) 8 8 9 9 ```bash 10 - claude mcp add-json leaflet '{"type": "http", "url": "https://leaflet-search-by-zzstoatzz.fastmcp.app/mcp"}' 10 + claude mcp add-json pub-search '{"type": "http", "url": "https://pub-search-by-zzstoatzz.fastmcp.app/mcp"}' 11 11 ``` 12 12 13 13 ### local ··· 15 15 run the MCP server locally with `uvx`: 16 16 17 17 ```bash 18 - uvx --from git+https://github.com/zzstoatzz/leaflet-search#subdirectory=mcp leaflet-mcp 18 + uvx --from git+https://github.com/zzstoatzz/leaflet-search#subdirectory=mcp pub-search 19 19 ``` 20 20 21 21 to add it to claude code as a local stdio server: 22 22 23 23 ```bash 24 - claude mcp add leaflet -- uvx --from 'git+https://github.com/zzstoatzz/leaflet-search#subdirectory=mcp' leaflet-mcp 24 + claude mcp add pub-search -- uvx --from 'git+https://github.com/zzstoatzz/leaflet-search#subdirectory=mcp' pub-search 25 25 ``` 26 26 27 27 ## workflow

+5 -5

mcp/pyproject.toml

··· 1 1 [project] 2 - name = "leaflet-mcp" 2 + name = "pub-search" 3 3 dynamic = ["version"] 4 - description = "MCP server for Leaflet - search decentralized publications on ATProto" 4 + description = "MCP server for searching ATProto publishing platforms (Leaflet, pckt, and more)" 5 5 readme = "README.md" 6 6 authors = [{ name = "zzstoatzz", email = "thrast36@gmail.com" }] 7 7 requires-python = ">=3.10" 8 8 license = "MIT" 9 9 10 - keywords = ["leaflet", "mcp", "atproto", "publications", "search", "fastmcp"] 10 + keywords = ["pub-search", "mcp", "atproto", "publications", "search", "fastmcp", "leaflet", "pckt"] 11 11 12 12 classifiers = [ 13 13 "Development Status :: 3 - Alpha", ··· 27 27 ] 28 28 29 29 [project.scripts] 30 - leaflet-mcp = "leaflet_mcp.server:main" 30 + pub-search = "pub_search.server:main" 31 31 32 32 [build-system] 33 33 requires = ["hatchling", "uv-dynamic-versioning>=0.7.0"] 34 34 build-backend = "hatchling.build" 35 35 36 36 [tool.hatch.build.targets.wheel] 37 - packages = ["src/leaflet_mcp"] 37 + packages = ["src/pub_search"] 38 38 39 39 [tool.hatch.version] 40 40 source = "uv-dynamic-versioning"

+67

mcp/scripts/test_live.py

··· 1 + #!/usr/bin/env python3 2 + """Test the pub-search MCP server.""" 3 + 4 + import asyncio 5 + import sys 6 + 7 + from fastmcp import Client 8 + from fastmcp.client.transports import FastMCPTransport 9 + 10 + from pub_search.server import mcp 11 + 12 + 13 + async def main(): 14 + # use local transport for testing, or live URL if --live flag 15 + if "--live" in sys.argv: 16 + print("testing against live Horizon server...") 17 + client = Client("https://pub-search-by-zzstoatzz.fastmcp.app/mcp") 18 + else: 19 + print("testing locally with FastMCPTransport...") 20 + client = Client(transport=FastMCPTransport(mcp)) 21 + 22 + async with client: 23 + # list tools 24 + print("=== tools ===") 25 + tools = await client.list_tools() 26 + for t in tools: 27 + print(f" {t.name}") 28 + 29 + # test search with new platform filter 30 + print("\n=== search(query='zig', platform='leaflet', limit=3) ===") 31 + result = await client.call_tool( 32 + "search", {"query": "zig", "platform": "leaflet", "limit": 3} 33 + ) 34 + for item in result.content: 35 + print(f" {item.text[:200]}...") 36 + 37 + # test search with since filter 38 + print("\n=== search(query='python', since='2025-01-01', limit=2) ===") 39 + result = await client.call_tool( 40 + "search", {"query": "python", "since": "2025-01-01", "limit": 2} 41 + ) 42 + for item in result.content: 43 + print(f" {item.text[:200]}...") 44 + 45 + # test get_tags 46 + print("\n=== get_tags() ===") 47 + result = await client.call_tool("get_tags", {}) 48 + for item in result.content: 49 + print(f" {item.text[:150]}...") 50 + 51 + # test get_stats 52 + print("\n=== get_stats() ===") 53 + result = await client.call_tool("get_stats", {}) 54 + for item in result.content: 55 + print(f" {item.text}") 56 + 57 + # test get_popular 58 + print("\n=== get_popular(limit=3) ===") 59 + result = await client.call_tool("get_popular", {"limit": 3}) 60 + for item in result.content: 61 + print(f" {item.text[:100]}...") 62 + 63 + print("\n=== all tests passed ===") 64 + 65 + 66 + if __name__ == "__main__": 67 + asyncio.run(main())

-5

mcp/src/leaflet_mcp/__init__.py

··· 1 - """Leaflet MCP server - search decentralized publications on ATProto.""" 2 - 3 - from leaflet_mcp.server import main, mcp 4 - 5 - __all__ = ["main", "mcp"]

-58

mcp/src/leaflet_mcp/_types.py

··· 1 - """Type definitions for Leaflet MCP responses.""" 2 - 3 - from typing import Literal 4 - 5 - from pydantic import BaseModel, computed_field 6 - 7 - 8 - class SearchResult(BaseModel): 9 - """A search result from the Leaflet API.""" 10 - 11 - type: Literal["article", "looseleaf", "publication"] 12 - uri: str 13 - did: str 14 - title: str 15 - snippet: str 16 - createdAt: str = "" 17 - rkey: str 18 - basePath: str = "" 19 - 20 - @computed_field 21 - @property 22 - def url(self) -> str: 23 - """web URL for this document.""" 24 - if self.basePath: 25 - return f"https://{self.basePath}/{self.rkey}" 26 - return "" 27 - 28 - 29 - class Tag(BaseModel): 30 - """A tag with document count.""" 31 - 32 - tag: str 33 - count: int 34 - 35 - 36 - class PopularSearch(BaseModel): 37 - """A popular search query with count.""" 38 - 39 - query: str 40 - count: int 41 - 42 - 43 - class Stats(BaseModel): 44 - """Leaflet index statistics.""" 45 - 46 - documents: int 47 - publications: int 48 - 49 - 50 - class Document(BaseModel): 51 - """Full document content from ATProto.""" 52 - 53 - uri: str 54 - title: str 55 - content: str 56 - createdAt: str = "" 57 - tags: list[str] = [] 58 - publicationUri: str = ""

-21

mcp/src/leaflet_mcp/client.py

··· 1 - """HTTP client for Leaflet search API.""" 2 - 3 - import os 4 - from contextlib import asynccontextmanager 5 - from typing import AsyncIterator 6 - 7 - import httpx 8 - 9 - # configurable via env var, defaults to production 10 - LEAFLET_API_URL = os.getenv("LEAFLET_API_URL", "https://leaflet-search-backend.fly.dev") 11 - 12 - 13 - @asynccontextmanager 14 - async def get_http_client() -> AsyncIterator[httpx.AsyncClient]: 15 - """Get an async HTTP client for Leaflet API requests.""" 16 - async with httpx.AsyncClient( 17 - base_url=LEAFLET_API_URL, 18 - timeout=30.0, 19 - headers={"Accept": "application/json"}, 20 - ) as client: 21 - yield client

-289

mcp/src/leaflet_mcp/server.py

··· 1 - """Leaflet MCP server implementation using fastmcp.""" 2 - 3 - from __future__ import annotations 4 - 5 - from typing import Any 6 - 7 - from fastmcp import FastMCP 8 - 9 - from leaflet_mcp._types import Document, PopularSearch, SearchResult, Stats, Tag 10 - from leaflet_mcp.client import get_http_client 11 - 12 - mcp = FastMCP("leaflet") 13 - 14 - 15 - # ----------------------------------------------------------------------------- 16 - # prompts 17 - # ----------------------------------------------------------------------------- 18 - 19 - 20 - @mcp.prompt("usage_guide") 21 - def usage_guide() -> str: 22 - """instructions for using leaflet MCP tools.""" 23 - return """\ 24 - # Leaflet MCP server usage guide 25 - 26 - Leaflet is a decentralized publishing platform on ATProto (the protocol behind Bluesky). 27 - This MCP server provides search and discovery tools for Leaflet publications. 28 - 29 - ## core tools 30 - 31 - - `search(query, tag)` - search documents and publications by text or tag 32 - - `get_document(uri)` - get the full content of a document by its AT-URI 33 - - `find_similar(uri)` - find documents similar to a given document 34 - - `get_tags()` - list all available tags with document counts 35 - - `get_stats()` - get index statistics (document/publication counts) 36 - - `get_popular()` - see popular search queries 37 - 38 - ## workflow for research 39 - 40 - 1. use `search("your topic")` to find relevant documents 41 - 2. use `get_document(uri)` to retrieve full content of interesting results 42 - 3. use `find_similar(uri)` to discover related content 43 - 44 - ## result types 45 - 46 - search returns three types of results: 47 - - **publication**: a collection of articles (like a blog or magazine) 48 - - **article**: a document that belongs to a publication 49 - - **looseleaf**: a standalone document not part of a publication 50 - 51 - ## AT-URIs 52 - 53 - documents are identified by AT-URIs like: 54 - `at://did:plc:abc123/pub.leaflet.document/xyz789` 55 - 56 - you can also browse documents on the web at leaflet.pub 57 - """ 58 - 59 - 60 - @mcp.prompt("search_tips") 61 - def search_tips() -> str: 62 - """tips for effective searching.""" 63 - return """\ 64 - # Leaflet search tips 65 - 66 - ## text search 67 - - searches both document titles and content 68 - - uses FTS5 full-text search with prefix matching 69 - - the last word gets prefix matching: "cat dog" matches "cat dogs" 70 - 71 - ## tag filtering 72 - - combine text search with tag filter: `search("python", tag="programming")` 73 - - use `get_tags()` to discover available tags 74 - - tags are only applied to documents, not publications 75 - 76 - ## finding related content 77 - - after finding an interesting document, use `find_similar(uri)` 78 - - similarity is based on semantic embeddings (voyage-3-lite) 79 - - great for exploring related topics 80 - 81 - ## browsing by popularity 82 - - use `get_popular()` to see what others are searching for 83 - - can inspire new research directions 84 - """ 85 - 86 - 87 - # ----------------------------------------------------------------------------- 88 - # tools 89 - # ----------------------------------------------------------------------------- 90 - 91 - 92 - @mcp.tool 93 - async def search( 94 - query: str = "", 95 - tag: str | None = None, 96 - limit: int = 5, 97 - ) -> list[SearchResult]: 98 - """search leaflet documents and publications. 99 - 100 - searches the full text of documents (titles and content) and publications. 101 - results include a snippet showing where the match was found. 102 - 103 - args: 104 - query: search query (searches titles and content) 105 - tag: optional tag to filter by (only applies to documents) 106 - limit: max results to return (default 5, max 40) 107 - 108 - returns: 109 - list of search results with uri, title, snippet, and metadata 110 - """ 111 - if not query and not tag: 112 - return [] 113 - 114 - params: dict[str, Any] = {} 115 - if query: 116 - params["q"] = query 117 - if tag: 118 - params["tag"] = tag 119 - 120 - async with get_http_client() as client: 121 - response = await client.get("/search", params=params) 122 - response.raise_for_status() 123 - results = response.json() 124 - 125 - # apply client-side limit since API returns up to 40 126 - return [SearchResult(**r) for r in results[:limit]] 127 - 128 - 129 - @mcp.tool 130 - async def get_document(uri: str) -> Document: 131 - """get the full content of a document by its AT-URI. 132 - 133 - fetches the complete document from ATProto, including full text content. 134 - use this after finding documents via search to get the complete text. 135 - 136 - args: 137 - uri: the AT-URI of the document (e.g., at://did:plc:.../pub.leaflet.document/...) 138 - 139 - returns: 140 - document with full content, title, tags, and metadata 141 - """ 142 - # use pdsx to fetch the actual record from ATProto 143 - try: 144 - from pdsx._internal.operations import get_record 145 - from pdsx.mcp.client import get_atproto_client 146 - except ImportError as e: 147 - raise RuntimeError( 148 - "pdsx is required for fetching full documents. install with: uv add pdsx" 149 - ) from e 150 - 151 - # extract repo from URI for PDS discovery 152 - # at://did:plc:xxx/collection/rkey 153 - parts = uri.replace("at://", "").split("/") 154 - if len(parts) < 3: 155 - raise ValueError(f"invalid AT-URI: {uri}") 156 - 157 - repo = parts[0] 158 - 159 - async with get_atproto_client(target_repo=repo) as client: 160 - record = await get_record(client, uri) 161 - 162 - value = record.value 163 - # DotDict doesn't have a working .get(), convert to dict first 164 - if hasattr(value, "to_dict") and callable(value.to_dict): 165 - value = value.to_dict() 166 - elif not isinstance(value, dict): 167 - value = dict(value) 168 - 169 - # extract content from leaflet's block structure 170 - # pages[].blocks[].block.plaintext 171 - content_parts = [] 172 - for page in value.get("pages", []): 173 - for block_wrapper in page.get("blocks", []): 174 - block = block_wrapper.get("block", {}) 175 - plaintext = block.get("plaintext", "") 176 - if plaintext: 177 - content_parts.append(plaintext) 178 - 179 - content = "\n\n".join(content_parts) 180 - 181 - return Document( 182 - uri=record.uri, 183 - title=value.get("title", ""), 184 - content=content, 185 - createdAt=value.get("publishedAt", "") or value.get("createdAt", ""), 186 - tags=value.get("tags", []), 187 - publicationUri=value.get("publication", ""), 188 - ) 189 - 190 - 191 - @mcp.tool 192 - async def find_similar(uri: str, limit: int = 5) -> list[SearchResult]: 193 - """find documents similar to a given document. 194 - 195 - uses vector similarity (voyage-3-lite embeddings) to find semantically 196 - related documents. great for discovering related content after finding 197 - an interesting document. 198 - 199 - args: 200 - uri: the AT-URI of the document to find similar content for 201 - limit: max similar documents to return (default 5) 202 - 203 - returns: 204 - list of similar documents with uri, title, and metadata 205 - """ 206 - async with get_http_client() as client: 207 - response = await client.get("/similar", params={"uri": uri}) 208 - response.raise_for_status() 209 - results = response.json() 210 - 211 - return [SearchResult(**r) for r in results[:limit]] 212 - 213 - 214 - @mcp.tool 215 - async def get_tags() -> list[Tag]: 216 - """list all available tags with document counts. 217 - 218 - returns tags sorted by document count (most popular first). 219 - useful for discovering topics and filtering searches. 220 - 221 - returns: 222 - list of tags with their document counts 223 - """ 224 - async with get_http_client() as client: 225 - response = await client.get("/tags") 226 - response.raise_for_status() 227 - results = response.json() 228 - 229 - return [Tag(**t) for t in results] 230 - 231 - 232 - @mcp.tool 233 - async def get_stats() -> Stats: 234 - """get leaflet index statistics. 235 - 236 - returns: 237 - document and publication counts 238 - """ 239 - async with get_http_client() as client: 240 - response = await client.get("/stats") 241 - response.raise_for_status() 242 - return Stats(**response.json()) 243 - 244 - 245 - @mcp.tool 246 - async def get_popular(limit: int = 5) -> list[PopularSearch]: 247 - """get popular search queries. 248 - 249 - see what others are searching for on leaflet. 250 - can inspire new research directions. 251 - 252 - args: 253 - limit: max queries to return (default 5) 254 - 255 - returns: 256 - list of popular queries with search counts 257 - """ 258 - async with get_http_client() as client: 259 - response = await client.get("/popular") 260 - response.raise_for_status() 261 - results = response.json() 262 - 263 - return [PopularSearch(**p) for p in results[:limit]] 264 - 265 - 266 - # ----------------------------------------------------------------------------- 267 - # resources 268 - # ----------------------------------------------------------------------------- 269 - 270 - 271 - @mcp.resource("leaflet://stats") 272 - async def stats_resource() -> str: 273 - """current leaflet index statistics.""" 274 - stats = await get_stats() 275 - return f"Leaflet index: {stats.documents} documents, {stats.publications} publications" 276 - 277 - 278 - # ----------------------------------------------------------------------------- 279 - # entrypoint 280 - # ----------------------------------------------------------------------------- 281 - 282 - 283 - def main() -> None: 284 - """run the MCP server.""" 285 - mcp.run() 286 - 287 - 288 - if __name__ == "__main__": 289 - main()

+5

mcp/src/pub_search/__init__.py

··· 1 + """MCP server for searching ATProto publishing platforms.""" 2 + 3 + from pub_search.server import main, mcp 4 + 5 + __all__ = ["main", "mcp"]

+59

mcp/src/pub_search/_types.py

··· 1 + """Type definitions for Leaflet MCP responses.""" 2 + 3 + from typing import Literal 4 + 5 + from pydantic import BaseModel, computed_field 6 + 7 + 8 + class SearchResult(BaseModel): 9 + """A search result from the Leaflet API.""" 10 + 11 + type: Literal["article", "looseleaf", "publication"] 12 + uri: str 13 + did: str 14 + title: str 15 + snippet: str 16 + createdAt: str = "" 17 + rkey: str 18 + basePath: str = "" 19 + platform: Literal["leaflet", "pckt", "offprint", "greengale", "other"] = "leaflet" 20 + 21 + @computed_field 22 + @property 23 + def url(self) -> str: 24 + """web URL for this document.""" 25 + if self.basePath: 26 + return f"https://{self.basePath}/{self.rkey}" 27 + return "" 28 + 29 + 30 + class Tag(BaseModel): 31 + """A tag with document count.""" 32 + 33 + tag: str 34 + count: int 35 + 36 + 37 + class PopularSearch(BaseModel): 38 + """A popular search query with count.""" 39 + 40 + query: str 41 + count: int 42 + 43 + 44 + class Stats(BaseModel): 45 + """Leaflet index statistics.""" 46 + 47 + documents: int 48 + publications: int 49 + 50 + 51 + class Document(BaseModel): 52 + """Full document content from ATProto.""" 53 + 54 + uri: str 55 + title: str 56 + content: str 57 + createdAt: str = "" 58 + tags: list[str] = [] 59 + publicationUri: str = ""

+21

mcp/src/pub_search/client.py

··· 1 + """HTTP client for leaflet-search API.""" 2 + 3 + import os 4 + from contextlib import asynccontextmanager 5 + from typing import AsyncIterator 6 + 7 + import httpx 8 + 9 + # configurable via env var, defaults to production 10 + API_URL = os.getenv("LEAFLET_SEARCH_API_URL", "https://leaflet-search-backend.fly.dev") 11 + 12 + 13 + @asynccontextmanager 14 + async def get_http_client() -> AsyncIterator[httpx.AsyncClient]: 15 + """Get an async HTTP client for API requests.""" 16 + async with httpx.AsyncClient( 17 + base_url=API_URL, 18 + timeout=30.0, 19 + headers={"Accept": "application/json"}, 20 + ) as client: 21 + yield client

+276

mcp/src/pub_search/server.py

··· 1 + """MCP server for searching ATProto publishing platforms.""" 2 + 3 + from __future__ import annotations 4 + 5 + from typing import Any, Literal 6 + 7 + from fastmcp import FastMCP 8 + 9 + from pub_search._types import Document, PopularSearch, SearchResult, Stats, Tag 10 + from pub_search.client import get_http_client 11 + 12 + mcp = FastMCP("pub-search") 13 + 14 + 15 + # ----------------------------------------------------------------------------- 16 + # prompts 17 + # ----------------------------------------------------------------------------- 18 + 19 + 20 + @mcp.prompt("usage_guide") 21 + def usage_guide() -> str: 22 + """instructions for using pub-search MCP tools.""" 23 + return """\ 24 + # pub-search MCP 25 + 26 + search ATProto publishing platforms: leaflet, pckt, offprint, greengale. 27 + 28 + ## tools 29 + 30 + - `search(query, tag, platform, since)` - full-text search with filters 31 + - `get_document(uri)` - fetch full content by AT-URI 32 + - `find_similar(uri)` - semantic similarity search 33 + - `get_tags()` - available tags 34 + - `get_stats()` - index statistics 35 + - `get_popular()` - popular queries 36 + 37 + ## workflow 38 + 39 + 1. `search("topic")` or `search("topic", platform="leaflet")` 40 + 2. `get_document(uri)` for full text 41 + 3. `find_similar(uri)` for related content 42 + 43 + ## result types 44 + 45 + - **article**: document in a publication 46 + - **looseleaf**: standalone document 47 + - **publication**: the publication itself 48 + 49 + results include a `url` field for web access. 50 + """ 51 + 52 + 53 + @mcp.prompt("search_tips") 54 + def search_tips() -> str: 55 + """tips for effective searching.""" 56 + return """\ 57 + # search tips 58 + 59 + - prefix matching on last word: "cat dog" matches "cat dogs" 60 + - combine filters: `search("python", tag="tutorial", platform="leaflet")` 61 + - use `since="2025-01-01"` for recent content 62 + - `find_similar(uri)` for semantic similarity (voyage-3-lite embeddings) 63 + - `get_tags()` to discover available tags 64 + """ 65 + 66 + 67 + # ----------------------------------------------------------------------------- 68 + # tools 69 + # ----------------------------------------------------------------------------- 70 + 71 + 72 + Platform = Literal["leaflet", "pckt", "offprint", "greengale", "other"] 73 + 74 + 75 + @mcp.tool 76 + async def search( 77 + query: str = "", 78 + tag: str | None = None, 79 + platform: Platform | None = None, 80 + since: str | None = None, 81 + limit: int = 5, 82 + ) -> list[SearchResult]: 83 + """search documents and publications. 84 + 85 + args: 86 + query: search query (titles and content) 87 + tag: filter by tag 88 + platform: filter by platform (leaflet, pckt, offprint, greengale, other) 89 + since: ISO date - only documents created after this date 90 + limit: max results (default 5, max 40) 91 + 92 + returns: 93 + list of results with uri, title, snippet, platform, and web url 94 + """ 95 + if not query and not tag: 96 + return [] 97 + 98 + params: dict[str, Any] = {} 99 + if query: 100 + params["q"] = query 101 + if tag: 102 + params["tag"] = tag 103 + if platform: 104 + params["platform"] = platform 105 + if since: 106 + params["since"] = since 107 + 108 + async with get_http_client() as client: 109 + response = await client.get("/search", params=params) 110 + response.raise_for_status() 111 + results = response.json() 112 + 113 + return [SearchResult(**r) for r in results[:limit]] 114 + 115 + 116 + @mcp.tool 117 + async def get_document(uri: str) -> Document: 118 + """get the full content of a document by its AT-URI. 119 + 120 + fetches the complete document from ATProto, including full text content. 121 + use this after finding documents via search to get the complete text. 122 + 123 + args: 124 + uri: the AT-URI of the document (e.g., at://did:plc:.../pub.leaflet.document/...) 125 + 126 + returns: 127 + document with full content, title, tags, and metadata 128 + """ 129 + # use pdsx to fetch the actual record from ATProto 130 + try: 131 + from pdsx._internal.operations import get_record 132 + from pdsx.mcp.client import get_atproto_client 133 + except ImportError as e: 134 + raise RuntimeError( 135 + "pdsx is required for fetching full documents. install with: uv add pdsx" 136 + ) from e 137 + 138 + # extract repo from URI for PDS discovery 139 + # at://did:plc:xxx/collection/rkey 140 + parts = uri.replace("at://", "").split("/") 141 + if len(parts) < 3: 142 + raise ValueError(f"invalid AT-URI: {uri}") 143 + 144 + repo = parts[0] 145 + 146 + async with get_atproto_client(target_repo=repo) as client: 147 + record = await get_record(client, uri) 148 + 149 + value = record.value 150 + # DotDict doesn't have a working .get(), convert to dict first 151 + if hasattr(value, "to_dict") and callable(value.to_dict): 152 + value = value.to_dict() 153 + elif not isinstance(value, dict): 154 + value = dict(value) 155 + 156 + # extract content from leaflet's block structure 157 + # pages[].blocks[].block.plaintext 158 + content_parts = [] 159 + for page in value.get("pages", []): 160 + for block_wrapper in page.get("blocks", []): 161 + block = block_wrapper.get("block", {}) 162 + plaintext = block.get("plaintext", "") 163 + if plaintext: 164 + content_parts.append(plaintext) 165 + 166 + content = "\n\n".join(content_parts) 167 + 168 + return Document( 169 + uri=record.uri, 170 + title=value.get("title", ""), 171 + content=content, 172 + createdAt=value.get("publishedAt", "") or value.get("createdAt", ""), 173 + tags=value.get("tags", []), 174 + publicationUri=value.get("publication", ""), 175 + ) 176 + 177 + 178 + @mcp.tool 179 + async def find_similar(uri: str, limit: int = 5) -> list[SearchResult]: 180 + """find documents similar to a given document. 181 + 182 + uses vector similarity (voyage-3-lite embeddings) to find semantically 183 + related documents. great for discovering related content after finding 184 + an interesting document. 185 + 186 + args: 187 + uri: the AT-URI of the document to find similar content for 188 + limit: max similar documents to return (default 5) 189 + 190 + returns: 191 + list of similar documents with uri, title, and metadata 192 + """ 193 + async with get_http_client() as client: 194 + response = await client.get("/similar", params={"uri": uri}) 195 + response.raise_for_status() 196 + results = response.json() 197 + 198 + return [SearchResult(**r) for r in results[:limit]] 199 + 200 + 201 + @mcp.tool 202 + async def get_tags() -> list[Tag]: 203 + """list all available tags with document counts. 204 + 205 + returns tags sorted by document count (most popular first). 206 + useful for discovering topics and filtering searches. 207 + 208 + returns: 209 + list of tags with their document counts 210 + """ 211 + async with get_http_client() as client: 212 + response = await client.get("/tags") 213 + response.raise_for_status() 214 + results = response.json() 215 + 216 + return [Tag(**t) for t in results] 217 + 218 + 219 + @mcp.tool 220 + async def get_stats() -> Stats: 221 + """get index statistics. 222 + 223 + returns: 224 + document and publication counts 225 + """ 226 + async with get_http_client() as client: 227 + response = await client.get("/stats") 228 + response.raise_for_status() 229 + return Stats(**response.json()) 230 + 231 + 232 + @mcp.tool 233 + async def get_popular(limit: int = 5) -> list[PopularSearch]: 234 + """get popular search queries. 235 + 236 + see what others are searching for. 237 + can inspire new research directions. 238 + 239 + args: 240 + limit: max queries to return (default 5) 241 + 242 + returns: 243 + list of popular queries with search counts 244 + """ 245 + async with get_http_client() as client: 246 + response = await client.get("/popular") 247 + response.raise_for_status() 248 + results = response.json() 249 + 250 + return [PopularSearch(**p) for p in results[:limit]] 251 + 252 + 253 + # ----------------------------------------------------------------------------- 254 + # resources 255 + # ----------------------------------------------------------------------------- 256 + 257 + 258 + @mcp.resource("pub-search://stats") 259 + async def stats_resource() -> str: 260 + """current index statistics.""" 261 + stats = await get_stats() 262 + return f"pub search index: {stats.documents} documents, {stats.publications} publications" 263 + 264 + 265 + # ----------------------------------------------------------------------------- 266 + # entrypoint 267 + # ----------------------------------------------------------------------------- 268 + 269 + 270 + def main() -> None: 271 + """run the MCP server.""" 272 + mcp.run() 273 + 274 + 275 + if __name__ == "__main__": 276 + main()

+12 -9

mcp/tests/test_mcp.py

··· 1 - """tests for leaflet MCP server.""" 1 + """tests for pub-search MCP server.""" 2 2 3 3 import pytest 4 4 from mcp.types import TextContent ··· 6 6 from fastmcp.client import Client 7 7 from fastmcp.client.transports import FastMCPTransport 8 8 9 - from leaflet_mcp._types import Document, PopularSearch, SearchResult, Stats, Tag 10 - from leaflet_mcp.server import mcp 9 + from pub_search._types import Document, PopularSearch, SearchResult, Stats, Tag 10 + from pub_search.server import mcp 11 11 12 12 13 13 class TestTypes: ··· 23 23 snippet="this is a test...", 24 24 createdAt="2025-01-01T00:00:00Z", 25 25 rkey="123", 26 - basePath="/blog", 26 + basePath="gyst.leaflet.pub", 27 + platform="leaflet", 27 28 ) 28 29 assert r.type == "article" 29 30 assert r.uri == "at://did:plc:abc/pub.leaflet.document/123" 30 31 assert r.title == "test article" 32 + assert r.platform == "leaflet" 33 + assert r.url == "https://gyst.leaflet.pub/123" 31 34 32 35 def test_search_result_looseleaf(self): 33 36 """SearchResult supports looseleaf type.""" ··· 93 96 94 97 def test_mcp_server_imports(self): 95 98 """mcp server can be imported without errors.""" 96 - from leaflet_mcp import mcp 99 + from pub_search import mcp 97 100 98 - assert mcp.name == "leaflet" 101 + assert mcp.name == "pub-search" 99 102 100 103 def test_exports(self): 101 104 """all expected exports are available.""" 102 - from leaflet_mcp import main, mcp 105 + from pub_search import main, mcp 103 106 104 107 assert mcp is not None 105 108 assert main is not None ··· 138 141 resources = await client.list_resources() 139 142 140 143 resource_uris = {str(r.uri) for r in resources} 141 - assert "leaflet://stats" in resource_uris 144 + assert "pub-search://stats" in resource_uris 142 145 143 146 async def test_usage_guide_prompt_content(self, client): 144 147 """usage_guide prompt returns helpful content.""" ··· 148 151 assert len(result.messages) > 0 149 152 content = result.messages[0].content 150 153 assert isinstance(content, TextContent) 151 - assert "Leaflet" in content.text 154 + assert "pub-search" in content.text 152 155 assert "search" in content.text 153 156 154 157 async def test_search_tips_prompt_content(self, client):

+32 -32

mcp/uv.lock

··· 691 691 ] 692 692 693 693 [[package]] 694 - name = "leaflet-mcp" 695 - source = { editable = "." } 696 - dependencies = [ 697 - { name = "fastmcp" }, 698 - { name = "httpx" }, 699 - { name = "pdsx" }, 700 - ] 701 - 702 - [package.dev-dependencies] 703 - dev = [ 704 - { name = "pytest" }, 705 - { name = "pytest-asyncio" }, 706 - { name = "pytest-sugar" }, 707 - { name = "ruff" }, 708 - ] 709 - 710 - [package.metadata] 711 - requires-dist = [ 712 - { name = "fastmcp", specifier = ">=2.0" }, 713 - { name = "httpx", specifier = ">=0.28" }, 714 - { name = "pdsx", git = "https://github.com/zzstoatzz/pdsx.git" }, 715 - ] 716 - 717 - [package.metadata.requires-dev] 718 - dev = [ 719 - { name = "pytest", specifier = ">=8.3.0" }, 720 - { name = "pytest-asyncio", specifier = ">=0.25.0" }, 721 - { name = "pytest-sugar" }, 722 - { name = "ruff", specifier = ">=0.12.0" }, 723 - ] 724 - 725 - [[package]] 726 694 name = "libipld" 727 695 version = "3.3.2" 728 696 source = { registry = "https://pypi.org/simple" } ··· 1075 1043 sdist = { url = "https://files.pythonhosted.org/packages/23/53/3edb5d68ecf6b38fcbcc1ad28391117d2a322d9a1a3eff04bfdb184d8c3b/prometheus_client-0.23.1.tar.gz", hash = "sha256:6ae8f9081eaaaf153a2e959d2e6c4f4fb57b12ef76c8c7980202f1e57b48b2ce", size = 80481, upload-time = "2025-09-18T20:47:25.043Z" } 1076 1044 wheels = [ 1077 1045 { url = "https://files.pythonhosted.org/packages/b8/db/14bafcb4af2139e046d03fd00dea7873e48eafe18b7d2797e73d6681f210/prometheus_client-0.23.1-py3-none-any.whl", hash = "sha256:dd1913e6e76b59cfe44e7a4b83e01afc9873c1bdfd2ed8739f1e76aeca115f99", size = 61145, upload-time = "2025-09-18T20:47:23.875Z" }, 1046 + ] 1047 + 1048 + [[package]] 1049 + name = "pub-search" 1050 + source = { editable = "." } 1051 + dependencies = [ 1052 + { name = "fastmcp" }, 1053 + { name = "httpx" }, 1054 + { name = "pdsx" }, 1055 + ] 1056 + 1057 + [package.dev-dependencies] 1058 + dev = [ 1059 + { name = "pytest" }, 1060 + { name = "pytest-asyncio" }, 1061 + { name = "pytest-sugar" }, 1062 + { name = "ruff" }, 1063 + ] 1064 + 1065 + [package.metadata] 1066 + requires-dist = [ 1067 + { name = "fastmcp", specifier = ">=2.0" }, 1068 + { name = "httpx", specifier = ">=0.28" }, 1069 + { name = "pdsx", git = "https://github.com/zzstoatzz/pdsx.git" }, 1070 + ] 1071 + 1072 + [package.metadata.requires-dev] 1073 + dev = [ 1074 + { name = "pytest", specifier = ">=8.3.0" }, 1075 + { name = "pytest-asyncio", specifier = ">=0.25.0" }, 1076 + { name = "pytest-sugar" }, 1077 + { name = "ruff", specifier = ">=0.12.0" }, 1078 1078 ] 1079 1079 1080 1080 [[package]]

+384

scripts/backfill-pds

··· 1 + #!/usr/bin/env -S uv run --script --quiet 2 + # /// script 3 + # requires-python = ">=3.12" 4 + # dependencies = ["httpx", "pydantic-settings"] 5 + # /// 6 + """ 7 + Backfill records directly from a PDS. 8 + 9 + Usage: 10 + ./scripts/backfill-pds did:plc:mkqt76xvfgxuemlwlx6ruc3w 11 + ./scripts/backfill-pds zat.dev 12 + """ 13 + 14 + import argparse 15 + import json 16 + import os 17 + import sys 18 + 19 + import httpx 20 + from pydantic_settings import BaseSettings, SettingsConfigDict 21 + 22 + 23 + class Settings(BaseSettings): 24 + model_config = SettingsConfigDict( 25 + env_file=os.environ.get("ENV_FILE", ".env"), extra="ignore" 26 + ) 27 + 28 + turso_url: str 29 + turso_token: str 30 + 31 + @property 32 + def turso_host(self) -> str: 33 + url = self.turso_url 34 + if url.startswith("libsql://"): 35 + url = url[len("libsql://") :] 36 + return url 37 + 38 + 39 + def resolve_handle(handle: str) -> str: 40 + """Resolve a handle to a DID.""" 41 + resp = httpx.get( 42 + f"https://bsky.social/xrpc/com.atproto.identity.resolveHandle", 43 + params={"handle": handle}, 44 + timeout=30, 45 + ) 46 + resp.raise_for_status() 47 + return resp.json()["did"] 48 + 49 + 50 + def get_pds_endpoint(did: str) -> str: 51 + """Get PDS endpoint from PLC directory.""" 52 + resp = httpx.get(f"https://plc.directory/{did}", timeout=30) 53 + resp.raise_for_status() 54 + data = resp.json() 55 + for service in data.get("service", []): 56 + if service.get("type") == "AtprotoPersonalDataServer": 57 + return service["serviceEndpoint"] 58 + raise ValueError(f"No PDS endpoint found for {did}") 59 + 60 + 61 + def list_records(pds: str, did: str, collection: str) -> list[dict]: 62 + """List all records from a collection.""" 63 + records = [] 64 + cursor = None 65 + while True: 66 + params = {"repo": did, "collection": collection, "limit": 100} 67 + if cursor: 68 + params["cursor"] = cursor 69 + resp = httpx.get( 70 + f"{pds}/xrpc/com.atproto.repo.listRecords", params=params, timeout=30 71 + ) 72 + resp.raise_for_status() 73 + data = resp.json() 74 + records.extend(data.get("records", [])) 75 + cursor = data.get("cursor") 76 + if not cursor: 77 + break 78 + return records 79 + 80 + 81 + def turso_exec(settings: Settings, sql: str, args: list | None = None) -> None: 82 + """Execute a statement against Turso.""" 83 + stmt = {"sql": sql} 84 + if args: 85 + # Handle None values properly - use null type 86 + stmt["args"] = [] 87 + for a in args: 88 + if a is None: 89 + stmt["args"].append({"type": "null"}) 90 + else: 91 + stmt["args"].append({"type": "text", "value": str(a)}) 92 + 93 + response = httpx.post( 94 + f"https://{settings.turso_host}/v2/pipeline", 95 + headers={ 96 + "Authorization": f"Bearer {settings.turso_token}", 97 + "Content-Type": "application/json", 98 + }, 99 + json={"requests": [{"type": "execute", "stmt": stmt}, {"type": "close"}]}, 100 + timeout=30, 101 + ) 102 + if response.status_code != 200: 103 + print(f"Turso error: {response.text}", file=sys.stderr) 104 + response.raise_for_status() 105 + 106 + 107 + def extract_leaflet_blocks(pages: list) -> str: 108 + """Extract text from leaflet pages/blocks structure.""" 109 + texts = [] 110 + for page in pages: 111 + if not isinstance(page, dict): 112 + continue 113 + blocks = page.get("blocks", []) 114 + for wrapper in blocks: 115 + if not isinstance(wrapper, dict): 116 + continue 117 + block = wrapper.get("block", {}) 118 + if not isinstance(block, dict): 119 + continue 120 + # Extract plaintext from text, header, blockquote, code blocks 121 + block_type = block.get("$type", "") 122 + if block_type in ( 123 + "pub.leaflet.blocks.text", 124 + "pub.leaflet.blocks.header", 125 + "pub.leaflet.blocks.blockquote", 126 + "pub.leaflet.blocks.code", 127 + ): 128 + plaintext = block.get("plaintext", "") 129 + if plaintext: 130 + texts.append(plaintext) 131 + # Handle lists 132 + elif block_type == "pub.leaflet.blocks.unorderedList": 133 + texts.extend(extract_list_items(block.get("children", []))) 134 + return " ".join(texts) 135 + 136 + 137 + def extract_list_items(children: list) -> list[str]: 138 + """Recursively extract text from list items.""" 139 + texts = [] 140 + for child in children: 141 + if not isinstance(child, dict): 142 + continue 143 + content = child.get("content", {}) 144 + if isinstance(content, dict): 145 + plaintext = content.get("plaintext", "") 146 + if plaintext: 147 + texts.append(plaintext) 148 + # Recurse into nested children 149 + nested = child.get("children", []) 150 + if nested: 151 + texts.extend(extract_list_items(nested)) 152 + return texts 153 + 154 + 155 + def extract_document(record: dict, collection: str) -> dict | None: 156 + """Extract document fields from a record.""" 157 + value = record.get("value", {}) 158 + 159 + # Get title 160 + title = value.get("title") 161 + if not title: 162 + return None 163 + 164 + # Get content - try textContent (site.standard), then leaflet blocks 165 + content = value.get("textContent") or "" 166 + if not content: 167 + # Try leaflet-style pages/blocks at top level (pub.leaflet.document) 168 + pages = value.get("pages", []) 169 + if pages: 170 + content = extract_leaflet_blocks(pages) 171 + if not content: 172 + # Try content.pages (site.standard.document with pub.leaflet.content) 173 + content_obj = value.get("content") 174 + if isinstance(content_obj, dict): 175 + pages = content_obj.get("pages", []) 176 + if pages: 177 + content = extract_leaflet_blocks(pages) 178 + 179 + # Get created_at 180 + created_at = value.get("createdAt", "") 181 + 182 + # Get publication reference - try "publication" (leaflet) then "site" (site.standard) 183 + publication = value.get("publication") or value.get("site") 184 + publication_uri = None 185 + if publication: 186 + if isinstance(publication, dict): 187 + publication_uri = publication.get("uri") 188 + elif isinstance(publication, str): 189 + publication_uri = publication 190 + 191 + # Get URL path (site.standard.document uses "path" field like "/001") 192 + path = value.get("path") 193 + 194 + # Get tags 195 + tags = value.get("tags", []) 196 + if not isinstance(tags, list): 197 + tags = [] 198 + 199 + # Determine platform from collection (site.standard is a lexicon, not a platform) 200 + if collection.startswith("pub.leaflet"): 201 + platform = "leaflet" 202 + elif collection.startswith("blog.pckt"): 203 + platform = "pckt" 204 + else: 205 + # site.standard.* and others - platform will be detected from publication basePath 206 + platform = "unknown" 207 + 208 + return { 209 + "title": title, 210 + "content": content, 211 + "created_at": created_at, 212 + "publication_uri": publication_uri, 213 + "tags": tags, 214 + "platform": platform, 215 + "collection": collection, 216 + "path": path, 217 + } 218 + 219 + 220 + def main(): 221 + parser = argparse.ArgumentParser(description="Backfill records from a PDS") 222 + parser.add_argument("identifier", help="DID or handle to backfill") 223 + parser.add_argument("--dry-run", action="store_true", help="Show what would be done") 224 + args = parser.parse_args() 225 + 226 + try: 227 + settings = Settings() # type: ignore 228 + except Exception as e: 229 + print(f"error loading settings: {e}", file=sys.stderr) 230 + print("required env vars: TURSO_URL, TURSO_TOKEN", file=sys.stderr) 231 + sys.exit(1) 232 + 233 + # Resolve identifier to DID 234 + identifier = args.identifier 235 + if identifier.startswith("did:"): 236 + did = identifier 237 + else: 238 + print(f"resolving handle {identifier}...") 239 + did = resolve_handle(identifier) 240 + print(f" -> {did}") 241 + 242 + # Get PDS endpoint 243 + print(f"looking up PDS for {did}...") 244 + pds = get_pds_endpoint(did) 245 + print(f" -> {pds}") 246 + 247 + # Collections to fetch 248 + collections = [ 249 + "pub.leaflet.document", 250 + "pub.leaflet.publication", 251 + "site.standard.document", 252 + "site.standard.publication", 253 + ] 254 + 255 + total_docs = 0 256 + total_pubs = 0 257 + 258 + for collection in collections: 259 + print(f"fetching {collection}...") 260 + try: 261 + records = list_records(pds, did, collection) 262 + except httpx.HTTPStatusError as e: 263 + if e.response.status_code == 400: 264 + print(f" (no records)") 265 + continue 266 + raise 267 + 268 + if not records: 269 + print(f" (no records)") 270 + continue 271 + 272 + print(f" found {len(records)} records") 273 + 274 + for record in records: 275 + uri = record["uri"] 276 + # Parse rkey from URI: at://did/collection/rkey 277 + parts = uri.split("/") 278 + rkey = parts[-1] 279 + 280 + if collection.endswith(".document"): 281 + doc = extract_document(record, collection) 282 + if not doc: 283 + print(f" skip {uri} (no title)") 284 + continue 285 + 286 + if args.dry_run: 287 + print(f" would insert: {doc['title'][:50]}...") 288 + else: 289 + # Insert document 290 + turso_exec( 291 + settings, 292 + """ 293 + INSERT INTO documents (uri, did, rkey, title, content, created_at, publication_uri, platform, source_collection, path) 294 + VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?) 295 + ON CONFLICT(did, rkey) DO UPDATE SET 296 + uri = excluded.uri, 297 + title = excluded.title, 298 + content = excluded.content, 299 + created_at = excluded.created_at, 300 + publication_uri = excluded.publication_uri, 301 + platform = excluded.platform, 302 + source_collection = excluded.source_collection, 303 + path = excluded.path 304 + """, 305 + [uri, did, rkey, doc["title"], doc["content"], doc["created_at"], doc["publication_uri"], doc["platform"], doc["collection"], doc["path"]], 306 + ) 307 + # Insert tags 308 + for tag in doc["tags"]: 309 + turso_exec( 310 + settings, 311 + "INSERT OR IGNORE INTO document_tags (document_uri, tag) VALUES (?, ?)", 312 + [uri, tag], 313 + ) 314 + # Update FTS index (delete then insert, FTS5 doesn't support ON CONFLICT) 315 + turso_exec(settings, "DELETE FROM documents_fts WHERE uri = ?", [uri]) 316 + turso_exec( 317 + settings, 318 + "INSERT INTO documents_fts (uri, title, content) VALUES (?, ?, ?)", 319 + [uri, doc["title"], doc["content"]], 320 + ) 321 + print(f" indexed: {doc['title'][:50]}...") 322 + total_docs += 1 323 + 324 + elif collection.endswith(".publication"): 325 + value = record["value"] 326 + name = value.get("name", "") 327 + description = value.get("description") 328 + # base_path: try leaflet's "base_path", then strip scheme from site.standard's "url" 329 + base_path = value.get("base_path") 330 + if not base_path: 331 + url = value.get("url") 332 + if url: 333 + # Strip https:// or http:// prefix 334 + if url.startswith("https://"): 335 + base_path = url[len("https://"):] 336 + elif url.startswith("http://"): 337 + base_path = url[len("http://"):] 338 + else: 339 + base_path = url 340 + 341 + if args.dry_run: 342 + print(f" would insert pub: {name}") 343 + else: 344 + turso_exec( 345 + settings, 346 + """ 347 + INSERT INTO publications (uri, did, rkey, name, description, base_path) 348 + VALUES (?, ?, ?, ?, ?, ?) 349 + ON CONFLICT(uri) DO UPDATE SET 350 + name = excluded.name, 351 + description = excluded.description, 352 + base_path = excluded.base_path 353 + """, 354 + [uri, did, rkey, name, description, base_path], 355 + ) 356 + print(f" indexed pub: {name}") 357 + total_pubs += 1 358 + 359 + # post-process: detect platform from publication basePath 360 + if not args.dry_run and (total_docs > 0 or total_pubs > 0): 361 + print("detecting platforms from publication basePath...") 362 + turso_exec( 363 + settings, 364 + """ 365 + UPDATE documents SET platform = 'pckt' 366 + WHERE platform IN ('standardsite', 'unknown') 367 + AND publication_uri IN (SELECT uri FROM publications WHERE base_path LIKE '%pckt.blog%') 368 + """, 369 + ) 370 + turso_exec( 371 + settings, 372 + """ 373 + UPDATE documents SET platform = 'leaflet' 374 + WHERE platform IN ('standardsite', 'unknown') 375 + AND publication_uri IN (SELECT uri FROM publications WHERE base_path LIKE '%leaflet.pub%') 376 + """, 377 + ) 378 + print(" done") 379 + 380 + print(f"\ndone! {total_docs} documents, {total_pubs} publications") 381 + 382 + 383 + if __name__ == "__main__": 384 + main()

+198

scripts/bench-search

··· 1 + #!/usr/bin/env -S uv run --script --quiet 2 + # /// script 3 + # requires-python = ">=3.12" 4 + # dependencies = ["httpx", "rich"] 5 + # /// 6 + """ 7 + benchmark search API permutations to find performance issues. 8 + 9 + Usage: 10 + ./scripts/bench-search # run with defaults 11 + ./scripts/bench-search --runs 5 # more runs per permutation 12 + ./scripts/bench-search --local # test local server 13 + """ 14 + 15 + import asyncio 16 + import statistics 17 + import sys 18 + import time 19 + from dataclasses import dataclass 20 + 21 + import httpx 22 + from rich.console import Console 23 + 24 + BASE_URL = "https://leaflet-search-backend.fly.dev" 25 + 26 + QUERIES = ["python", "atproto", "rust", "blog", ""] 27 + TAGS = ["atproto", "bluesky", "rust", "Webworld", ""] 28 + PLATFORMS = ["leaflet", "pckt", ""] 29 + LIMITS = [10, 40, ""] 30 + 31 + 32 + @dataclass 33 + class Result: 34 + name: str 35 + params: dict 36 + times: list[float] 37 + count: int 38 + status: int 39 + 40 + @property 41 + def avg(self) -> float: 42 + return statistics.mean(self.times) * 1000 43 + 44 + @property 45 + def min(self) -> float: 46 + return min(self.times) * 1000 47 + 48 + @property 49 + def max(self) -> float: 50 + return max(self.times) * 1000 51 + 52 + @property 53 + def p50(self) -> float: 54 + return statistics.median(self.times) * 1000 55 + 56 + @property 57 + def stdev(self) -> float: 58 + return statistics.stdev(self.times) * 1000 if len(self.times) > 1 else 0 59 + 60 + 61 + async def bench_search( 62 + client: httpx.AsyncClient, params: dict, runs: int 63 + ) -> Result: 64 + """benchmark a single search permutation.""" 65 + times = [] 66 + count = 0 67 + status = 0 68 + 69 + # filter empty params 70 + clean_params = {k: v for k, v in params.items() if v} 71 + 72 + name = " + ".join(f"{k}={v}" for k, v in clean_params.items()) or "(empty)" 73 + 74 + for _ in range(runs): 75 + start = time.perf_counter() 76 + try: 77 + resp = await client.get("/search", params=clean_params) 78 + elapsed = time.perf_counter() - start 79 + times.append(elapsed) 80 + status = resp.status_code 81 + if resp.status_code == 200: 82 + count = len(resp.json()) 83 + except Exception as e: 84 + times.append(30.0) # timeout 85 + status = 0 86 + # small delay between runs to avoid overwhelming server 87 + await asyncio.sleep(0.1) 88 + 89 + return Result(name=name, params=clean_params, times=times, count=count, status=status) 90 + 91 + 92 + async def run_benchmarks(runs: int, console: Console) -> list[Result]: 93 + """run all search permutations.""" 94 + results = [] 95 + 96 + # build permutations - focus on meaningful combinations 97 + permutations = [] 98 + 99 + # query only 100 + for q in QUERIES: 101 + if q: 102 + permutations.append({"q": q}) 103 + 104 + # tag only 105 + for tag in TAGS: 106 + if tag: 107 + permutations.append({"tag": tag}) 108 + 109 + # query + tag 110 + for q in ["python", "blog"]: 111 + for tag in ["atproto", "rust"]: 112 + permutations.append({"q": q, "tag": tag}) 113 + 114 + # platform filter 115 + for platform in PLATFORMS: 116 + if platform: 117 + permutations.append({"q": "blog", "platform": platform}) 118 + 119 + # limit variations 120 + for limit in LIMITS: 121 + if limit: 122 + permutations.append({"q": "python", "limit": limit}) 123 + 124 + # tag + platform 125 + for tag in ["atproto", "bluesky"]: 126 + permutations.append({"tag": tag, "platform": "leaflet"}) 127 + 128 + # empty (should return recent) 129 + permutations.append({}) 130 + 131 + console.print(f"[dim]running {len(permutations)} permutations × {runs} runs each...[/dim]\n") 132 + 133 + async with httpx.AsyncClient(base_url=BASE_URL, timeout=30) as client: 134 + # warmup 135 + await client.get("/health") 136 + 137 + for i, params in enumerate(permutations): 138 + result = await bench_search(client, params, runs) 139 + results.append(result) 140 + # progress dot 141 + console.print(".", end="", style="dim") 142 + if (i + 1) % 20 == 0: 143 + console.print() 144 + 145 + console.print("\n") 146 + return results 147 + 148 + 149 + def print_results(results: list[Result], console: Console): 150 + """print results as plain text.""" 151 + # sort by p50 descending to show slowest first 152 + results.sort(key=lambda r: r.p50, reverse=True) 153 + 154 + console.print("results (sorted by p50, slowest first):\n") 155 + console.print(f"{'permutation':<40} {'p50':>8} {'avg':>8} {'min':>8} {'max':>8} {'count':>6}") 156 + console.print("-" * 80) 157 + 158 + for r in results: 159 + p50_str = f"{r.p50:.0f}ms" 160 + if r.p50 > 1000: 161 + p50_str = f"[red bold]{p50_str}[/red bold]" 162 + elif r.p50 > 500: 163 + p50_str = f"[yellow]{p50_str}[/yellow]" 164 + elif r.p50 < 200: 165 + p50_str = f"[green]{p50_str}[/green]" 166 + 167 + console.print(f"{r.name:<40} {p50_str:>8} {r.avg:>7.0f}ms {r.min:>7.0f}ms {r.max:>7.0f}ms {r.count:>6}") 168 + 169 + # summary 170 + console.print() 171 + slow = [r for r in results if r.p50 > 500] 172 + if slow: 173 + console.print(f"[yellow]⚠ {len(slow)} slow (p50 > 500ms)[/yellow]") 174 + else: 175 + console.print("[green]✓ all under 500ms p50[/green]") 176 + 177 + 178 + async def main(): 179 + global BASE_URL 180 + 181 + runs = 3 182 + if "--runs" in sys.argv: 183 + idx = sys.argv.index("--runs") 184 + if idx + 1 < len(sys.argv): 185 + runs = int(sys.argv[idx + 1]) 186 + 187 + if "--local" in sys.argv: 188 + BASE_URL = "http://localhost:3000" 189 + 190 + console = Console() 191 + console.print(f"[bold]benchmarking {BASE_URL}[/bold]\n") 192 + 193 + results = await run_benchmarks(runs, console) 194 + print_results(results, console) 195 + 196 + 197 + if __name__ == "__main__": 198 + asyncio.run(main())

+164

scripts/dashboard.py

··· 1 + #!/usr/bin/env python3 2 + # /// script 3 + # requires-python = ">=3.11" 4 + # dependencies = ["httpx", "rich", "plotext", "typer"] 5 + # /// 6 + """pub-search terminal dashboard 7 + 8 + usage: 9 + uv run scripts/dashboard.py # default view 10 + uv run scripts/dashboard.py --days 14 # longer timeline 11 + """ 12 + 13 + import httpx 14 + import plotext as plt 15 + import typer 16 + from rich.console import Console 17 + from rich.panel import Panel 18 + from rich.table import Table 19 + 20 + API_BASE = "https://leaflet-search-backend.fly.dev" 21 + console = Console() 22 + app = typer.Typer(add_completion=False) 23 + 24 + 25 + def fetch_stats() -> dict: 26 + """fetch /stats endpoint""" 27 + resp = httpx.get(f"{API_BASE}/stats", timeout=30) 28 + resp.raise_for_status() 29 + return resp.json() 30 + 31 + 32 + def fetch_dashboard() -> dict: 33 + """fetch /api/dashboard endpoint""" 34 + resp = httpx.get(f"{API_BASE}/api/dashboard", timeout=30) 35 + resp.raise_for_status() 36 + return resp.json() 37 + 38 + 39 + def display_overview(stats: dict) -> None: 40 + """show document/publication counts""" 41 + table = Table(show_header=False, box=None, padding=(0, 2), expand=True) 42 + table.add_column(style="dim") 43 + table.add_column(style="bold green", justify="right") 44 + 45 + table.add_row("documents", f"{stats['documents']:,}") 46 + table.add_row("publications", f"{stats['publications']:,}") 47 + table.add_row("embeddings", f"{stats['embeddings']:,}") 48 + 49 + embed_pct = (stats['embeddings'] / stats['documents'] * 100) if stats['documents'] > 0 else 0 50 + table.add_row("embedded", f"{embed_pct:.0f}%") 51 + 52 + console.print(Panel(table, title="[bold]index[/]", border_style="blue", expand=False)) 53 + 54 + 55 + def display_usage(stats: dict) -> None: 56 + """show usage and similarity cache stats""" 57 + hits = stats.get('cache_hits', 0) 58 + misses = stats.get('cache_misses', 0) 59 + total = hits + misses 60 + hit_rate = (hits / total * 100) if total > 0 else 0 61 + 62 + table = Table(show_header=False, box=None, padding=(0, 2), expand=True) 63 + table.add_column(style="dim") 64 + table.add_column(style="bold cyan", justify="right") 65 + 66 + table.add_row("searches", f"{stats.get('searches', 0):,}") 67 + table.add_row("errors", f"{stats.get('errors', 0):,}") 68 + table.add_row("similar cache hit", f"{hit_rate:.0f}% ({hits}/{total})") 69 + 70 + console.print(Panel(table, title="[bold]usage[/]", border_style="cyan", expand=False)) 71 + 72 + 73 + def display_latency(stats: dict) -> None: 74 + """show latency percentiles""" 75 + timing = stats.get('timing', {}) 76 + if not timing: 77 + return 78 + 79 + table = Table(box=None, padding=(0, 1), expand=True) 80 + table.add_column("endpoint", style="dim") 81 + table.add_column("p50", justify="right", style="green") 82 + table.add_column("p95", justify="right", style="yellow") 83 + table.add_column("p99", justify="right", style="red") 84 + table.add_column("count", justify="right", style="dim") 85 + 86 + for endpoint in ['search', 'similar', 'tags', 'popular']: 87 + if endpoint in timing: 88 + t = timing[endpoint] 89 + table.add_row( 90 + endpoint, 91 + f"{t['p50_ms']:.0f}ms", 92 + f"{t['p95_ms']:.0f}ms", 93 + f"{t['p99_ms']:.0f}ms", 94 + f"{t['count']:,}", 95 + ) 96 + 97 + console.print(Panel(table, title="[bold]latency[/]", border_style="magenta", expand=False)) 98 + 99 + 100 + def display_timeline(dashboard: dict, days: int) -> None: 101 + """show indexing activity chart""" 102 + timeline = dashboard.get('timeline', [])[:days] 103 + if not timeline: 104 + return 105 + 106 + timeline = list(reversed(timeline)) # oldest first 107 + dates = [d['date'][-5:] for d in timeline] # MM-DD 108 + counts = [d['count'] for d in timeline] 109 + 110 + plt.clear_figure() 111 + plt.theme("dark") 112 + plt.title("documents indexed per day") 113 + plt.bar(dates, counts, color="cyan") 114 + plt.plotsize(70, 12) 115 + plt.show() 116 + print() 117 + 118 + 119 + def display_latency_chart(stats: dict) -> None: 120 + """bar chart of p50 latencies by endpoint""" 121 + timing = stats.get('timing', {}) 122 + if not timing: 123 + return 124 + 125 + endpoints = [] 126 + p50s = [] 127 + for endpoint in ['search', 'similar', 'tags', 'popular']: 128 + if endpoint in timing: 129 + endpoints.append(endpoint) 130 + p50s.append(timing[endpoint]['p50_ms']) 131 + 132 + plt.clear_figure() 133 + plt.theme("dark") 134 + plt.title("p50 latency by endpoint (ms)") 135 + plt.bar(endpoints, p50s, color="cyan") 136 + plt.plotsize(50, 10) 137 + plt.show() 138 + print() 139 + 140 + 141 + @app.command() 142 + def main( 143 + days: int = typer.Option(7, "-d", "--days", help="days of timeline to show"), 144 + ) -> None: 145 + """pub-search terminal dashboard""" 146 + console.print("\n[bold cyan]pub-search[/] dashboard\n") 147 + 148 + try: 149 + stats = fetch_stats() 150 + dashboard = fetch_dashboard() 151 + except httpx.HTTPError as e: 152 + console.print(f"[red]error fetching data:[/] {e}") 153 + raise typer.Exit(1) 154 + 155 + display_overview(stats) 156 + display_usage(stats) 157 + display_latency(stats) 158 + print() 159 + display_timeline(dashboard, days) 160 + display_latency_chart(stats) 161 + 162 + 163 + if __name__ == "__main__": 164 + app()

+109

scripts/enumerate-standard-repos

··· 1 + #!/usr/bin/env -S uv run --script --quiet 2 + # /// script 3 + # requires-python = ">=3.12" 4 + # dependencies = ["httpx"] 5 + # /// 6 + """ 7 + Enumerate repos with site.standard.* records and add them to TAP. 8 + 9 + TAP only signals on one collection, so we use this to discover repos 10 + that use site.standard.publication (pckt, etc) and add them to TAP. 11 + 12 + Usage: 13 + ./scripts/enumerate-standard-repos 14 + ./scripts/enumerate-standard-repos --dry-run 15 + """ 16 + 17 + import argparse 18 + import sys 19 + 20 + import httpx 21 + 22 + RELAY_URL = "https://relay1.us-east.bsky.network" 23 + TAP_URL = "http://leaflet-search-tap.internal:2480" # fly internal network 24 + COLLECTION = "site.standard.publication" 25 + 26 + 27 + def enumerate_repos(relay_url: str, collection: str) -> list[str]: 28 + """Enumerate all repos with records in the given collection.""" 29 + dids = [] 30 + cursor = None 31 + 32 + print(f"enumerating repos with {collection}...") 33 + 34 + while True: 35 + params = {"collection": collection, "limit": 1000} 36 + if cursor: 37 + params["cursor"] = cursor 38 + 39 + resp = httpx.get( 40 + f"{relay_url}/xrpc/com.atproto.sync.listReposByCollection", 41 + params=params, 42 + timeout=60, 43 + ) 44 + resp.raise_for_status() 45 + data = resp.json() 46 + 47 + repos = data.get("repos", []) 48 + for repo in repos: 49 + dids.append(repo["did"]) 50 + 51 + if not repos: 52 + break 53 + 54 + cursor = data.get("cursor") 55 + if not cursor: 56 + break 57 + 58 + print(f" found {len(dids)} repos so far...") 59 + 60 + return dids 61 + 62 + 63 + def add_repos_to_tap(tap_url: str, dids: list[str]) -> None: 64 + """Add repos to TAP for syncing.""" 65 + if not dids: 66 + return 67 + 68 + # batch in chunks of 100 69 + batch_size = 100 70 + for i in range(0, len(dids), batch_size): 71 + batch = dids[i:i + batch_size] 72 + resp = httpx.post( 73 + f"{tap_url}/repos/add", 74 + json={"dids": batch}, 75 + timeout=30, 76 + ) 77 + resp.raise_for_status() 78 + print(f" added batch {i // batch_size + 1}: {len(batch)} repos") 79 + 80 + 81 + def main(): 82 + parser = argparse.ArgumentParser(description="Enumerate and add standard.site repos to TAP") 83 + parser.add_argument("--dry-run", action="store_true", help="Show what would be done") 84 + parser.add_argument("--relay-url", default=RELAY_URL, help="Relay URL") 85 + parser.add_argument("--tap-url", default=TAP_URL, help="TAP URL") 86 + args = parser.parse_args() 87 + 88 + dids = enumerate_repos(args.relay_url, COLLECTION) 89 + print(f"found {len(dids)} repos with {COLLECTION}") 90 + 91 + if not dids: 92 + print("no repos to add") 93 + return 94 + 95 + if args.dry_run: 96 + print("dry run - would add these repos to TAP:") 97 + for did in dids[:10]: 98 + print(f" {did}") 99 + if len(dids) > 10: 100 + print(f" ... and {len(dids) - 10} more") 101 + return 102 + 103 + print(f"adding {len(dids)} repos to TAP...") 104 + add_repos_to_tap(args.tap_url, dids) 105 + print("done!") 106 + 107 + 108 + if __name__ == "__main__": 109 + main()

+107

scripts/exercise-api

··· 1 + #!/usr/bin/env -S uv run --script --quiet 2 + # /// script 3 + # requires-python = ">=3.12" 4 + # dependencies = ["httpx"] 5 + # /// 6 + """ 7 + Exercise all leaflet-search API endpoints to generate traffic. 8 + 9 + Usage: 10 + ./scripts/exercise-api # run with defaults 11 + ./scripts/exercise-api --count 20 # more requests per endpoint 12 + 13 + Check logfire for latency/error data after running. 14 + """ 15 + 16 + import asyncio 17 + import random 18 + import sys 19 + 20 + import httpx 21 + 22 + BASE_URL = "https://leaflet-search-backend.fly.dev" 23 + 24 + SEARCH_QUERIES = [ 25 + "python", "rust", "zig", "javascript", "typescript", 26 + "prefect", "workflow", "automation", "data", "api", 27 + "database", "sqlite", "machine learning", "llm", "claude", 28 + "bluesky", "atproto", "leaflet", "publishing", "markdown", 29 + "web", "server", "deploy", "docker", "async", 30 + ] 31 + 32 + 33 + async def exercise_search(client: httpx.AsyncClient, count: int): 34 + """Hit search endpoint with various queries.""" 35 + print(f"search: {count} requests...") 36 + for i in range(count): 37 + q = random.choice(SEARCH_QUERIES) 38 + resp = await client.get(f"/search", params={"q": q}) 39 + if resp.status_code != 200: 40 + print(f" search failed: {resp.status_code}") 41 + print(f" done") 42 + 43 + 44 + async def exercise_similar(client: httpx.AsyncClient, count: int): 45 + """Hit similar endpoint with document URIs.""" 46 + print(f"similar: {count} requests...") 47 + # first get some URIs from search 48 + resp = await client.get("/search", params={"q": "python"}) 49 + if resp.status_code != 200: 50 + print(" failed to get URIs") 51 + return 52 + docs = resp.json() 53 + if not docs: 54 + print(" no docs found") 55 + return 56 + 57 + uris = [d["uri"] for d in docs[:5]] 58 + for i in range(count): 59 + uri = random.choice(uris) 60 + resp = await client.get("/similar", params={"uri": uri}) 61 + if resp.status_code != 200: 62 + print(f" similar failed: {resp.status_code}") 63 + print(f" done") 64 + 65 + 66 + async def exercise_tags(client: httpx.AsyncClient, count: int): 67 + """Hit tags endpoint.""" 68 + print(f"tags: {count} requests...") 69 + for i in range(count): 70 + resp = await client.get("/tags") 71 + if resp.status_code != 200: 72 + print(f" tags failed: {resp.status_code}") 73 + print(f" done") 74 + 75 + 76 + async def exercise_popular(client: httpx.AsyncClient, count: int): 77 + """Hit popular endpoint.""" 78 + print(f"popular: {count} requests...") 79 + for i in range(count): 80 + resp = await client.get("/popular") 81 + if resp.status_code != 200: 82 + print(f" popular failed: {resp.status_code}") 83 + print(f" done") 84 + 85 + 86 + async def main(): 87 + count = 12 88 + if "--count" in sys.argv: 89 + idx = sys.argv.index("--count") 90 + if idx + 1 < len(sys.argv): 91 + count = int(sys.argv[idx + 1]) 92 + 93 + print(f"exercising {BASE_URL} ({count} requests per endpoint)\n") 94 + 95 + async with httpx.AsyncClient(base_url=BASE_URL, timeout=30) as client: 96 + await asyncio.gather( 97 + exercise_search(client, count), 98 + exercise_similar(client, count), 99 + exercise_tags(client, count), 100 + exercise_popular(client, count), 101 + ) 102 + 103 + print("\ndone - check logfire for results") 104 + 105 + 106 + if __name__ == "__main__": 107 + asyncio.run(main())

+86

scripts/rebuild-pub-fts

··· 1 + #!/usr/bin/env -S uv run --script --quiet 2 + # /// script 3 + # requires-python = ">=3.12" 4 + # dependencies = ["httpx", "pydantic-settings"] 5 + # /// 6 + """Rebuild publications_fts with base_path column for subdomain search.""" 7 + import os 8 + import httpx 9 + from pydantic_settings import BaseSettings, SettingsConfigDict 10 + 11 + 12 + class Settings(BaseSettings): 13 + model_config = SettingsConfigDict( 14 + env_file=os.environ.get("ENV_FILE", ".env"), extra="ignore" 15 + ) 16 + turso_url: str 17 + turso_token: str 18 + 19 + @property 20 + def turso_host(self) -> str: 21 + url = self.turso_url 22 + if url.startswith("libsql://"): 23 + url = url[len("libsql://") :] 24 + return url 25 + 26 + 27 + settings = Settings() # type: ignore 28 + 29 + print("Rebuilding publications_fts with base_path column...") 30 + 31 + response = httpx.post( 32 + f"https://{settings.turso_host}/v2/pipeline", 33 + headers={ 34 + "Authorization": f"Bearer {settings.turso_token}", 35 + "Content-Type": "application/json", 36 + }, 37 + json={ 38 + "requests": [ 39 + {"type": "execute", "stmt": {"sql": "DROP TABLE IF EXISTS publications_fts"}}, 40 + { 41 + "type": "execute", 42 + "stmt": { 43 + "sql": """ 44 + CREATE VIRTUAL TABLE publications_fts USING fts5( 45 + uri UNINDEXED, 46 + name, 47 + description, 48 + base_path 49 + ) 50 + """ 51 + }, 52 + }, 53 + { 54 + "type": "execute", 55 + "stmt": { 56 + "sql": """ 57 + INSERT INTO publications_fts (uri, name, description, base_path) 58 + SELECT uri, name, COALESCE(description, ''), COALESCE(base_path, '') 59 + FROM publications 60 + """ 61 + }, 62 + }, 63 + {"type": "execute", "stmt": {"sql": "SELECT COUNT(*) FROM publications_fts"}}, 64 + {"type": "close"}, 65 + ] 66 + }, 67 + timeout=60, 68 + ) 69 + response.raise_for_status() 70 + data = response.json() 71 + 72 + for i, result in enumerate(data["results"][:-1]): # skip close 73 + if result["type"] == "error": 74 + print(f"Step {i} error: {result['error']}") 75 + elif result["type"] == "ok": 76 + if i == 3: # count query 77 + rows = result["response"]["result"].get("rows", []) 78 + if rows: 79 + count = ( 80 + rows[0][0].get("value", rows[0][0]) 81 + if isinstance(rows[0][0], dict) 82 + else rows[0][0] 83 + ) 84 + print(f"Rebuilt with {count} publications") 85 + 86 + print("Done!")

+45

site/dashboard.css

··· 100 100 } 101 101 .pub-row:last-child { border-bottom: none; } 102 102 .pub-name { color: #888; } 103 + a.pub-name { color: #1B7340; } 104 + a.pub-name:hover { color: #2a9d5c; } 103 105 .pub-count { color: #666; } 106 + 107 + .timing-row { 108 + display: flex; 109 + justify-content: space-between; 110 + font-size: 12px; 111 + padding: 0.25rem 0; 112 + border-bottom: 1px solid #1a1a1a; 113 + } 114 + .timing-row:last-child { border-bottom: none; } 115 + .timing-name { color: #888; } 116 + .timing-value { color: #ccc; } 117 + .timing-value .dim { color: #555; } 118 + 119 + .latency-chart { 120 + position: relative; 121 + height: 80px; 122 + margin-top: 1rem; 123 + } 124 + .latency-chart canvas { 125 + width: 100%; 126 + height: 100%; 127 + } 128 + .latency-max { 129 + margin-left: auto; 130 + color: #444; 131 + } 132 + .latency-legend { 133 + display: flex; 134 + gap: 1rem; 135 + font-size: 10px; 136 + margin-top: 0.5rem; 137 + } 138 + .latency-legend span { 139 + display: flex; 140 + align-items: center; 141 + gap: 4px; 142 + color: #666; 143 + } 144 + .latency-legend .dot { 145 + width: 8px; 146 + height: 8px; 147 + border-radius: 50%; 148 + } 104 149 105 150 .tags { 106 151 display: flex;

+20 -13

site/dashboard.html

··· 3 3 <head> 4 4 <meta charset="UTF-8"> 5 5 <meta name="viewport" content="width=device-width, initial-scale=1.0"> 6 - <title>leaflet search / stats</title> 6 + <title>pub search / stats</title> 7 7 <link rel="icon" href="data:image/svg+xml,<svg xmlns='http://www.w3.org/2000/svg' viewBox='0 0 32 32'><rect x='4' y='18' width='6' height='10' fill='%231B7340'/><rect x='13' y='12' width='6' height='16' fill='%231B7340'/><rect x='22' y='6' width='6' height='22' fill='%231B7340'/></svg>"> 8 8 <link rel="stylesheet" href="dashboard.css"> 9 9 </head> 10 10 <body> 11 11 <div class="container"> 12 - <h1><a href="https://leaflet-search.pages.dev" class="title">leaflet search</a> <span class="dim">/ stats</span></h1> 12 + <h1><a href="https://pub-search.waow.tech" class="title">pub search</a> <span class="dim">/ stats</span></h1> 13 13 14 14 <section> 15 15 <div class="metrics"> ··· 30 30 </section> 31 31 32 32 <section> 33 - <div class="section-title">documents</div> 33 + <div class="section-title">documents by platform</div> 34 + <div class="chart-box"> 35 + <div id="platforms"></div> 36 + </div> 37 + </section> 38 + 39 + <section> 40 + <div class="section-title">request latency</div> 41 + <div class="chart-box"> 42 + <div id="timing"></div> 43 + </div> 44 + </section> 45 + 46 + <section> 47 + <div class="section-title">latency history (24h)</div> 34 48 <div class="chart-box"> 35 - <div class="doc-row"> 36 - <span class="doc-type">articles</span> 37 - <span class="doc-count" id="articles">--</span> 38 - </div> 39 - <div class="doc-row"> 40 - <span class="doc-type">looseleafs</span> 41 - <span class="doc-count" id="looseleafs">--</span> 42 - </div> 49 + <div id="latency-history"></div> 43 50 </div> 44 51 </section> 45 52 46 53 <section> 47 - <div class="section-title">activity (last 30 days)</div> 54 + <div class="section-title">documents indexed (last 30 days)</div> 48 55 <div class="chart-box"> 49 56 <div class="timeline" id="timeline"></div> 50 57 </div> ··· 63 70 </section> 64 71 65 72 <footer> 66 - <a href="https://leaflet-search.pages.dev">back</a> · source on <a href="https://tangled.sh/@zzstoatzz.io/leaflet-search">tangled</a> 73 + <a href="https://pub-search.waow.tech">back</a> · source on <a href="https://tangled.sh/@zzstoatzz.io/leaflet-search">tangled</a> 67 74 </footer> 68 75 </div> 69 76

+141 -4

site/dashboard.js

··· 47 47 pubs.forEach(p => { 48 48 const row = document.createElement('div'); 49 49 row.className = 'pub-row'; 50 - row.innerHTML = '<span class="pub-name">' + escapeHtml(p.name) + '</span><span class="pub-count">' + p.count + '</span>'; 50 + const nameHtml = p.basePath 51 + ? '<a href="https://' + escapeHtml(p.basePath) + '" target="_blank" class="pub-name">' + escapeHtml(p.name) + '</a>' 52 + : '<span class="pub-name">' + escapeHtml(p.name) + '</span>'; 53 + row.innerHTML = nameHtml + '<span class="pub-count">' + p.count + '</span>'; 51 54 el.appendChild(row); 52 55 }); 53 56 } ··· 57 60 if (!tags) return; 58 61 59 62 el.innerHTML = tags.slice(0, 20).map(t => 60 - '<a class="tag" href="https://leaflet-search.pages.dev/?tag=' + encodeURIComponent(t.tag) + '">' + 63 + '<a class="tag" href="https://pub-search.waow.tech/?tag=' + encodeURIComponent(t.tag) + '">' + 61 64 escapeHtml(t.tag) + '<span class="n">' + t.count + '</span></a>' 62 65 ).join(''); 63 66 } 64 67 68 + function renderPlatforms(platforms) { 69 + const el = document.getElementById('platforms'); 70 + if (!platforms) return; 71 + 72 + platforms.forEach(p => { 73 + const row = document.createElement('div'); 74 + row.className = 'doc-row'; 75 + row.innerHTML = '<span class="doc-type">' + escapeHtml(p.platform) + '</span><span class="doc-count">' + p.count + '</span>'; 76 + el.appendChild(row); 77 + }); 78 + } 79 + 80 + function formatMs(ms) { 81 + if (ms >= 1000) return (ms / 1000).toFixed(1) + 's'; 82 + if (ms >= 10) return ms.toFixed(0) + 'ms'; 83 + if (ms >= 1) return ms.toFixed(1) + 'ms'; 84 + return Math.round(ms * 1000) + 'µs'; 85 + } 86 + 87 + const ENDPOINT_COLORS = { search: '#8b5cf6', similar: '#06b6d4', tags: '#10b981', popular: '#f59e0b' }; 88 + 89 + function renderTiming(timing) { 90 + const el = document.getElementById('timing'); 91 + if (!timing) return; 92 + 93 + const endpoints = ['search', 'similar', 'tags', 'popular']; 94 + endpoints.forEach(name => { 95 + const t = timing[name]; 96 + if (!t) return; 97 + 98 + const row = document.createElement('div'); 99 + row.className = 'timing-row'; 100 + const color = ENDPOINT_COLORS[name]; 101 + 102 + if (t.count === 0) { 103 + row.innerHTML = '<span class="timing-name" style="color:' + color + '">' + name + '</span><span class="timing-value dim">--</span>'; 104 + } else { 105 + row.innerHTML = '<span class="timing-name" style="color:' + color + '">' + name + '</span>' + 106 + '<span class="timing-value">' + formatMs(t.p50_ms) + ' <span class="dim">p50</span> · ' + 107 + formatMs(t.p95_ms) + ' <span class="dim">p95</span></span>'; 108 + } 109 + el.appendChild(row); 110 + }); 111 + 112 + // render line chart for history 113 + renderLatencyChart(timing); 114 + } 115 + 116 + function renderLatencyChart(timing) { 117 + const container = document.getElementById('latency-history'); 118 + if (!container) return; 119 + 120 + const endpoints = ['search', 'similar', 'tags', 'popular']; 121 + 122 + // check if any endpoint has history data 123 + const hasData = endpoints.some(name => timing[name]?.history?.some(h => h.count > 0)); 124 + if (!hasData) { 125 + container.innerHTML = '<div style="color:#444;font-size:11px;text-align:center;padding:2rem">no data yet</div>'; 126 + return; 127 + } 128 + 129 + const canvas = document.createElement('canvas'); 130 + const chartDiv = document.createElement('div'); 131 + chartDiv.className = 'latency-chart'; 132 + chartDiv.appendChild(canvas); 133 + container.appendChild(chartDiv); 134 + 135 + const ctx = canvas.getContext('2d'); 136 + const dpr = window.devicePixelRatio || 1; 137 + const rect = chartDiv.getBoundingClientRect(); 138 + canvas.width = rect.width * dpr; 139 + canvas.height = rect.height * dpr; 140 + ctx.scale(dpr, dpr); 141 + 142 + const w = rect.width; 143 + const h = rect.height; 144 + const padding = { top: 10, right: 10, bottom: 20, left: 10 }; 145 + const chartW = w - padding.left - padding.right; 146 + const chartH = h - padding.top - padding.bottom; 147 + 148 + // find max value across all endpoints 149 + let maxVal = 0; 150 + endpoints.forEach(name => { 151 + const history = timing[name]?.history || []; 152 + history.forEach(p => { if (p.avg_ms > maxVal) maxVal = p.avg_ms; }); 153 + }); 154 + if (maxVal === 0) maxVal = 100; 155 + 156 + // draw each endpoint as an area chart 157 + endpoints.forEach(name => { 158 + const history = timing[name]?.history || []; 159 + if (history.length === 0) return; 160 + 161 + const color = ENDPOINT_COLORS[name]; 162 + const points = history.map((p, i) => ({ 163 + x: padding.left + (i / (history.length - 1)) * chartW, 164 + y: padding.top + chartH - (p.avg_ms / maxVal) * chartH 165 + })); 166 + 167 + // draw filled area 168 + ctx.beginPath(); 169 + ctx.moveTo(points[0].x, padding.top + chartH); 170 + points.forEach(p => ctx.lineTo(p.x, p.y)); 171 + ctx.lineTo(points[points.length - 1].x, padding.top + chartH); 172 + ctx.closePath(); 173 + ctx.fillStyle = color + '20'; 174 + ctx.fill(); 175 + 176 + // draw line 177 + ctx.beginPath(); 178 + ctx.moveTo(points[0].x, points[0].y); 179 + for (let i = 1; i < points.length; i++) { 180 + ctx.lineTo(points[i].x, points[i].y); 181 + } 182 + ctx.strokeStyle = color; 183 + ctx.lineWidth = 1.5; 184 + ctx.stroke(); 185 + }); 186 + 187 + // legend with max value 188 + const legend = document.createElement('div'); 189 + legend.className = 'latency-legend'; 190 + endpoints.forEach(name => { 191 + const span = document.createElement('span'); 192 + span.innerHTML = '<span class="dot" style="background:' + ENDPOINT_COLORS[name] + '"></span>' + name; 193 + legend.appendChild(span); 194 + }); 195 + const maxLabel = document.createElement('span'); 196 + maxLabel.className = 'latency-max'; 197 + maxLabel.textContent = formatMs(maxVal) + ' max'; 198 + legend.appendChild(maxLabel); 199 + container.appendChild(legend); 200 + } 201 + 65 202 function escapeHtml(str) { 66 203 return str 67 204 .replace(/&/g, '&') ··· 83 220 84 221 document.getElementById('searches').textContent = data.searches; 85 222 document.getElementById('publications').textContent = data.publications; 86 - document.getElementById('articles').textContent = data.articles; 87 - document.getElementById('looseleafs').textContent = data.looseleafs; 88 223 224 + renderPlatforms(data.platforms); 225 + renderTiming(data.timing); 89 226 renderTimeline(data.timeline); 90 227 renderPubs(data.topPubs); 91 228 renderTags(data.tags);

+399 -51

site/index.html

··· 4 4 <meta charset="UTF-8"> 5 5 <meta name="viewport" content="width=device-width, initial-scale=1.0"> 6 6 <link rel="icon" type="image/svg+xml" href="/favicon.svg"> 7 - <title>leaflet search</title> 8 - <meta name="description" content="search for leaflet"> 9 - <meta property="og:title" content="leaflet search"> 10 - <meta property="og:description" content="search for leaflet"> 7 + <title>pub search</title> 8 + <meta name="description" content="search atproto publishing platforms"> 9 + <meta property="og:title" content="pub search"> 10 + <meta property="og:description" content="search atproto publishing platforms"> 11 11 <meta property="og:type" content="website"> 12 12 <meta name="twitter:card" content="summary"> 13 - <meta name="twitter:title" content="leaflet search"> 14 - <meta name="twitter:description" content="search for leaflet"> 13 + <meta name="twitter:title" content="pub search"> 14 + <meta name="twitter:description" content="search atproto publishing platforms"> 15 15 <style> 16 16 * { box-sizing: border-box; margin: 0; padding: 0; } 17 17 ··· 75 75 flex: 1; 76 76 padding: 0.5rem; 77 77 font-family: monospace; 78 - font-size: 14px; 78 + font-size: 16px; /* prevents iOS auto-zoom on focus */ 79 79 background: #111; 80 80 border: 1px solid #333; 81 81 color: #ccc; ··· 111 111 .result-title { 112 112 color: #fff; 113 113 margin-bottom: 0.5rem; 114 + /* prevent long titles from breaking layout */ 115 + display: -webkit-box; 116 + -webkit-line-clamp: 2; 117 + -webkit-box-orient: vertical; 118 + overflow: hidden; 119 + word-break: break-word; 114 120 } 115 121 116 122 .result-title a { color: inherit; } ··· 325 331 margin-left: 4px; 326 332 } 327 333 334 + input.tag-input { 335 + /* match .tag styling exactly */ 336 + font-size: 11px; 337 + padding: 3px 8px; 338 + background: #151515; 339 + border: 1px solid #252525; 340 + border-radius: 3px; 341 + color: #777; 342 + font-family: monospace; 343 + /* prevent flex expansion from global input[type="text"] */ 344 + flex: none; 345 + width: 90px; 346 + } 347 + 348 + input.tag-input:hover { 349 + background: #1a1a1a; 350 + border-color: #333; 351 + color: #aaa; 352 + } 353 + 354 + input.tag-input:focus { 355 + outline: none; 356 + border-color: #1B7340; 357 + background: rgba(27, 115, 64, 0.1); 358 + color: #ccc; 359 + } 360 + 361 + input.tag-input::placeholder { 362 + color: #555; 363 + } 364 + 365 + .platform-filter { 366 + margin-bottom: 1rem; 367 + } 368 + 369 + .platform-filter-label { 370 + font-size: 11px; 371 + color: #444; 372 + margin-bottom: 0.5rem; 373 + } 374 + 375 + .platform-filter-list { 376 + display: flex; 377 + gap: 0.5rem; 378 + } 379 + 380 + .platform-option { 381 + font-size: 11px; 382 + padding: 3px 8px; 383 + background: #151515; 384 + border: 1px solid #252525; 385 + border-radius: 3px; 386 + cursor: pointer; 387 + color: #777; 388 + } 389 + 390 + .platform-option:hover { 391 + background: #1a1a1a; 392 + border-color: #333; 393 + color: #aaa; 394 + } 395 + 396 + .platform-option.active { 397 + background: rgba(180, 100, 64, 0.2); 398 + border-color: #d4956a; 399 + color: #d4956a; 400 + } 401 + 328 402 .active-filter { 329 403 display: flex; 330 404 align-items: center; ··· 346 420 .active-filter .clear:hover { 347 421 color: #c44; 348 422 } 423 + 424 + /* mobile improvements */ 425 + @media (max-width: 600px) { 426 + body { 427 + padding: 0.75rem; 428 + font-size: 13px; 429 + } 430 + 431 + .container { 432 + max-width: 100%; 433 + } 434 + 435 + /* ensure minimum 44px touch targets */ 436 + .tag, .platform-option, .suggestion, input.tag-input { 437 + min-height: 44px; 438 + display: inline-flex; 439 + align-items: center; 440 + padding: 0.5rem 0.75rem; 441 + } 442 + 443 + input.tag-input { 444 + width: 80px; 445 + } 446 + 447 + button { 448 + min-height: 44px; 449 + padding: 0.5rem 0.75rem; 450 + } 451 + 452 + /* stack search box on very small screens */ 453 + .search-box { 454 + flex-direction: column; 455 + gap: 0.5rem; 456 + } 457 + 458 + .search-box input[type="text"] { 459 + width: 100%; 460 + } 461 + 462 + .search-box button { 463 + width: 100%; 464 + } 465 + 466 + /* result card mobile tweaks */ 467 + .result { 468 + padding: 0.75rem 0; 469 + } 470 + 471 + .result:hover { 472 + margin: 0 -0.75rem; 473 + padding: 0.75rem; 474 + } 475 + 476 + .result-title { 477 + font-size: 14px; 478 + line-height: 1.4; 479 + } 480 + 481 + .result-snippet { 482 + font-size: 12px; 483 + line-height: 1.5; 484 + } 485 + 486 + /* badges inline on mobile */ 487 + .entity-type, .platform-badge { 488 + font-size: 9px; 489 + padding: 2px 5px; 490 + margin-right: 6px; 491 + vertical-align: middle; 492 + } 493 + 494 + /* tags wrap better on mobile */ 495 + .tags-list, .platform-filter-list { 496 + gap: 0.5rem; 497 + } 498 + 499 + /* suggestions responsive */ 500 + .suggestions { 501 + line-height: 2; 502 + } 503 + 504 + /* related items more compact */ 505 + .related-item { 506 + max-width: 150px; 507 + font-size: 11px; 508 + padding: 0.5rem; 509 + } 510 + } 511 + 512 + /* ensure touch targets on tablets too */ 513 + @media (hover: none) and (pointer: coarse) { 514 + .tag, .platform-option, .suggestion, .related-item, input.tag-input { 515 + min-height: 44px; 516 + display: inline-flex; 517 + align-items: center; 518 + } 519 + } 349 520 </style> 350 521 </head> 351 522 <body> 352 523 <div class="container"> 353 - <h1><a href="/" class="title">leaflet search</a> <span class="by">by <a href="https://bsky.app/profile/zzstoatzz.io" target="_blank">@zzstoatzz.io</a></span> <a href="https://tangled.sh/@zzstoatzz.io/leaflet-search" target="_blank" class="src">[src]</a></h1> 524 + <h1><a href="/" class="title">pub search</a> <span class="by">by <a href="https://bsky.app/profile/zzstoatzz.io" target="_blank">@zzstoatzz.io</a></span> <a href="https://tangled.sh/@zzstoatzz.io/leaflet-search" target="_blank" class="src">[src]</a></h1> 354 525 355 526 <div class="search-box"> 356 527 <input type="text" id="query" placeholder="search content..." autofocus> ··· 363 534 364 535 <div id="tags" class="tags"></div> 365 536 537 + <div id="platform-filter" class="platform-filter"></div> 538 + 366 539 <div id="results" class="results"> 367 540 <div class="empty-state"> 368 - <p>search for <a href="https://leaflet.pub" target="_blank">leaflet.pub</a></p> 541 + <p>search atproto publishing platforms</p> 542 + <p style="font-size:11px;margin-top:0.5rem"><a href="https://leaflet.pub" target="_blank">leaflet</a> · <a href="https://pckt.blog" target="_blank">pckt</a> · <a href="https://offprint.app" target="_blank">offprint</a> · <a href="https://greengale.app" target="_blank">greengale</a> · <a href="https://standard.site" target="_blank">other</a></p> 369 543 </div> 370 544 </div> 371 545 ··· 384 558 const tagsDiv = document.getElementById('tags'); 385 559 const activeFilterDiv = document.getElementById('active-filter'); 386 560 const suggestionsDiv = document.getElementById('suggestions'); 561 + const platformFilterDiv = document.getElementById('platform-filter'); 387 562 388 563 let currentTag = null; 564 + let currentPlatform = null; 389 565 let allTags = []; 390 566 let popularSearches = []; 391 567 392 - async function search(query, tag = null) { 393 - if (!query.trim() && !tag) return; 568 + async function search(query, tag = null, platform = null) { 569 + if (!query.trim() && !tag && !platform) return; 394 570 395 571 searchBtn.disabled = true; 396 572 let searchUrl = `${API_URL}/search?q=${encodeURIComponent(query || '')}`; 397 573 if (tag) searchUrl += `&tag=${encodeURIComponent(tag)}`; 574 + if (platform) searchUrl += `&platform=${encodeURIComponent(platform)}`; 398 575 resultsDiv.innerHTML = `<div class="status">searching...</div>`; 399 576 400 577 try { ··· 417 594 if (results.length === 0) { 418 595 resultsDiv.innerHTML = ` 419 596 <div class="empty-state"> 420 - <p>no results${query ? ` for "${escapeHtml(query)}"` : ''}${tag ? ` in #${escapeHtml(tag)}` : ''}</p> 597 + <p>no results${query ? ` for ${formatQueryForDisplay(query)}` : ''}${tag ? ` in #${escapeHtml(tag)}` : ''}${platform ? ` on ${escapeHtml(platform)}` : ''}</p> 421 598 <p>try different keywords</p> 422 599 </div> 423 600 `; ··· 429 606 430 607 for (const doc of results) { 431 608 const entityType = doc.type || 'article'; 432 - 433 - // build URL based on entity type 434 - let leafletUrl = null; 435 - if (entityType === 'publication') { 436 - // publications link to their base path 437 - leafletUrl = doc.basePath ? `https://${doc.basePath}` : null; 438 - } else { 439 - // articles and looseleafs link to specific document 440 - leafletUrl = doc.basePath && doc.rkey 441 - ? `https://${doc.basePath}/${doc.rkey}` 442 - : (doc.did && doc.rkey ? `https://leaflet.pub/p/${doc.did}/${doc.rkey}` : null); 443 - } 609 + const platform = doc.platform || 'leaflet'; 444 610 611 + // build URL based on entity type and platform 612 + const docUrl = buildDocUrl(doc, entityType, platform); 613 + // only show platform badge for actual platforms, not for lexicon-only records 614 + const platformConfig = PLATFORM_CONFIG[platform]; 615 + const platformBadge = platformConfig 616 + ? `<span class="platform-badge">${escapeHtml(platformConfig.label)}</span>` 617 + : ''; 445 618 const date = doc.createdAt ? new Date(doc.createdAt).toLocaleDateString() : ''; 446 - const platform = doc.platform || 'leaflet'; 447 - const platformBadge = platform !== 'leaflet' ? `<span class="platform-badge">${escapeHtml(platform)}</span>` : ''; 619 + 620 + // platform home URL for meta link 621 + const platformHome = getPlatformHome(platform, doc.basePath); 622 + 448 623 html += ` 449 624 <div class="result"> 450 625 <div class="result-title"> 451 626 <span class="entity-type ${entityType}">${entityType}</span>${platformBadge} 452 - ${leafletUrl 453 - ? `<a href="${leafletUrl}" target="_blank">${escapeHtml(doc.title || 'Untitled')}</a>` 627 + ${docUrl 628 + ? `<a href="${docUrl}" target="_blank">${escapeHtml(doc.title || 'Untitled')}</a>` 454 629 : escapeHtml(doc.title || 'Untitled')} 455 630 </div> 456 631 <div class="result-snippet">${highlightTerms(doc.snippet, query)}</div> 457 632 <div class="result-meta"> 458 - ${date ? `${date} | ` : ''}${doc.basePath 459 - ? `<a href="https://${doc.basePath}" target="_blank">${doc.basePath}</a>` 460 - : `<a href="https://leaflet.pub" target="_blank">leaflet.pub</a>`} 633 + ${date ? `${date} | ` : ''}${platformHome.url 634 + ? `<a href="${platformHome.url}" target="_blank">${platformHome.label}</a>` 635 + : platformHome.label} 461 636 </div> 462 637 </div> 463 638 `; ··· 485 660 })[c]); 486 661 } 487 662 663 + // display query without adding redundant quotes 664 + function formatQueryForDisplay(query) { 665 + if (!query) return ''; 666 + const escaped = escapeHtml(query); 667 + // if query is already fully quoted, don't add more quotes 668 + if (query.startsWith('"') && query.endsWith('"')) { 669 + return escaped; 670 + } 671 + return `"${escaped}"`; 672 + } 673 + 674 + // platform-specific URL patterns 675 + // note: some platforms use basePath from publication, which we prefer 676 + // fallback docUrl() is used when basePath is missing 677 + const PLATFORM_CONFIG = { 678 + leaflet: { 679 + home: 'https://leaflet.pub', 680 + label: 'leaflet.pub', 681 + // leaflet uses did/rkey pattern for fallback URLs 682 + docUrl: (did, rkey) => `https://leaflet.pub/p/${did}/${rkey}` 683 + }, 684 + pckt: { 685 + home: 'https://pckt.blog', 686 + label: 'pckt.blog', 687 + // pckt uses blog slugs + path, not did/rkey - needs basePath from publication 688 + docUrl: null 689 + }, 690 + offprint: { 691 + home: 'https://offprint.app', 692 + label: 'offprint.app', 693 + // offprint is in early beta, URL pattern unknown 694 + docUrl: null 695 + }, 696 + greengale: { 697 + home: 'https://greengale.app', 698 + label: 'greengale.app', 699 + // greengale uses basePath + path pattern 700 + docUrl: null 701 + }, 702 + other: { 703 + home: 'https://standard.site', 704 + label: 'other', 705 + // "other" = site.standard.* documents not from a known platform 706 + docUrl: null 707 + }, 708 + }; 709 + 710 + function buildDocUrl(doc, entityType, platform) { 711 + if (entityType === 'publication') { 712 + return doc.basePath ? `https://${doc.basePath}` : null; 713 + } 714 + 715 + // Platform-specific URL patterns: 716 + // 1. Leaflet: basePath + rkey (e.g., https://dad.leaflet.pub/3mburumcnbs2m) 717 + if (platform === 'leaflet' && doc.basePath && doc.rkey) { 718 + return `https://${doc.basePath}/${doc.rkey}`; 719 + } 720 + 721 + // 2. pckt: basePath + path (e.g., https://devlog.pckt.blog/some-slug-abc123) 722 + if (platform === 'pckt' && doc.basePath && doc.path) { 723 + const separator = doc.path.startsWith('/') ? '' : '/'; 724 + return `https://${doc.basePath}${separator}${doc.path}`; 725 + } 726 + 727 + // 3. Other platforms with path: basePath + path 728 + if (doc.basePath && doc.path) { 729 + const separator = doc.path.startsWith('/') ? '' : '/'; 730 + return `https://${doc.basePath}${separator}${doc.path}`; 731 + } 732 + 733 + // 4. Platform-specific fallback URL (e.g., leaflet.pub/p/did/rkey) 734 + const config = PLATFORM_CONFIG[platform]; 735 + if (config?.docUrl && doc.did && doc.rkey) { 736 + return config.docUrl(doc.did, doc.rkey); 737 + } 738 + 739 + // 5. Fallback: pdsls.dev universal viewer (always works for any AT Protocol record) 740 + if (doc.uri) { 741 + return `https://pdsls.dev/${doc.uri}`; 742 + } 743 + 744 + return null; 745 + } 746 + 747 + function getPlatformHome(platform, basePath) { 748 + if (basePath) { 749 + return { url: `https://${basePath}`, label: basePath }; 750 + } 751 + const config = PLATFORM_CONFIG[platform]; 752 + if (config) { 753 + return { url: config.home, label: config.label }; 754 + } 755 + // fallback for documents without a known platform - link to standard.site lexicon 756 + return { url: 'https://standard.site', label: 'other' }; 757 + } 758 + 488 759 function highlightTerms(text, query) { 489 760 if (!text || !query) return escapeHtml(text); 490 - const terms = query.toLowerCase().split(/\s+/).filter(t => t.length > 0); 761 + 762 + // extract terms: quoted phrases as single terms, or split by whitespace 763 + const terms = []; 764 + const phraseRegex = /"([^"]+)"/g; 765 + let match; 766 + let remaining = query.toLowerCase(); 767 + 768 + // extract quoted phrases first 769 + while ((match = phraseRegex.exec(query.toLowerCase())) !== null) { 770 + terms.push(match[1]); // the phrase without quotes 771 + remaining = remaining.replace(match[0], ' '); 772 + } 773 + 774 + // add remaining non-quoted terms 775 + remaining.split(/\s+/).filter(t => t.length > 0).forEach(t => terms.push(t)); 776 + 491 777 if (terms.length === 0) return escapeHtml(text); 492 778 493 779 // build regex that matches any term (case insensitive) ··· 503 789 const q = queryInput.value.trim(); 504 790 if (q) params.set('q', q); 505 791 if (currentTag) params.set('tag', currentTag); 792 + if (currentPlatform) params.set('platform', currentPlatform); 506 793 const url = params.toString() ? `?${params}` : '/'; 507 794 history.pushState(null, '', url); 508 795 } 509 796 510 797 function doSearch() { 511 798 updateUrl(); 512 - search(queryInput.value, currentTag); 799 + search(queryInput.value, currentTag, currentPlatform); 513 800 } 514 801 515 802 function setTag(tag) { 803 + if (currentTag === tag) { 804 + clearTag(); 805 + return; 806 + } 516 807 currentTag = tag; 517 808 renderActiveFilter(); 518 809 renderTags(); ··· 524 815 renderActiveFilter(); 525 816 renderTags(); 526 817 updateUrl(); 527 - if (queryInput.value.trim()) { 528 - search(queryInput.value, null); 818 + if (queryInput.value.trim() || currentPlatform) { 819 + search(queryInput.value, null, currentPlatform); 529 820 } else { 530 821 renderEmptyState(); 531 822 } 532 823 } 533 824 825 + function setPlatform(platform) { 826 + if (currentPlatform === platform) { 827 + clearPlatform(); 828 + return; 829 + } 830 + currentPlatform = platform; 831 + renderActiveFilter(); 832 + renderPlatformFilter(); 833 + doSearch(); 834 + } 835 + 836 + function clearPlatform() { 837 + currentPlatform = null; 838 + renderActiveFilter(); 839 + renderPlatformFilter(); 840 + updateUrl(); 841 + if (queryInput.value.trim() || currentTag) { 842 + search(queryInput.value, currentTag, null); 843 + } else { 844 + renderEmptyState(); 845 + } 846 + } 847 + 848 + function renderPlatformFilter() { 849 + const platforms = [ 850 + { id: 'leaflet', label: 'leaflet' }, 851 + { id: 'pckt', label: 'pckt' }, 852 + { id: 'offprint', label: 'offprint' }, 853 + { id: 'greengale', label: 'greengale' }, 854 + { id: 'other', label: 'other' }, 855 + ]; 856 + const html = platforms.map(p => ` 857 + <span class="platform-option${currentPlatform === p.id ? ' active' : ''}" onclick="setPlatform('${p.id}')">${p.label}</span> 858 + `).join(''); 859 + platformFilterDiv.innerHTML = `<div class="platform-filter-label">filter by platform:</div><div class="platform-filter-list">${html}</div>`; 860 + } 861 + 534 862 function renderActiveFilter() { 535 - if (!currentTag) { 863 + if (!currentTag && !currentPlatform) { 536 864 activeFilterDiv.innerHTML = ''; 537 865 return; 538 866 } 867 + let parts = []; 868 + if (currentTag) parts.push(`tag: <strong>#${escapeHtml(currentTag)}</strong>`); 869 + if (currentPlatform) parts.push(`platform: <strong>${escapeHtml(currentPlatform)}</strong>`); 870 + const clearActions = []; 871 + if (currentTag) clearActions.push(`<span class="clear" onclick="clearTag()">× tag</span>`); 872 + if (currentPlatform) clearActions.push(`<span class="clear" onclick="clearPlatform()">× platform</span>`); 539 873 activeFilterDiv.innerHTML = ` 540 874 <div class="active-filter"> 541 - <span>filtering by tag: <strong>#${escapeHtml(currentTag)}</strong> <span style="color:#666;font-size:10px">(documents only)</span></span> 542 - <span class="clear" onclick="clearTag()">× clear</span> 875 + <span>filtering by ${parts.join(', ')} <span style="color:#666;font-size:10px">(documents only)</span></span> 876 + ${clearActions.join(' ')} 543 877 </div> 544 878 `; 545 879 } 546 880 547 881 function renderTags() { 548 - if (allTags.length === 0) { 549 - tagsDiv.innerHTML = ''; 550 - return; 551 - } 552 - const html = allTags.slice(0, 15).map(t => ` 882 + const tagsHtml = allTags.slice(0, 15).map(t => ` 553 883 <span class="tag${currentTag === t.tag ? ' active' : ''}" onclick="setTag('${escapeHtml(t.tag)}')">${escapeHtml(t.tag)}<span class="count">${t.count}</span></span> 554 884 `).join(''); 555 - tagsDiv.innerHTML = `<div class="tags-label">filter by tag:</div><div class="tags-list">${html}</div>`; 885 + const inputHtml = `<input type="text" class="tag-input" id="tag-input" placeholder="enter tag..." value="${currentTag && !allTags.some(t => t.tag === currentTag) ? escapeHtml(currentTag) : ''}">`; 886 + tagsDiv.innerHTML = `<div class="tags-label">filter by tag:</div><div class="tags-list">${tagsHtml}${inputHtml}</div>`; 887 + 888 + // bind enter key handler 889 + const tagInput = document.getElementById('tag-input'); 890 + tagInput.addEventListener('keydown', e => { 891 + if (e.key === 'Enter') { 892 + e.preventDefault(); 893 + const val = tagInput.value.trim(); 894 + if (val) { 895 + setTag(val); 896 + } 897 + } 898 + }); 556 899 } 557 900 558 901 async function loadTags() { ··· 601 944 function renderEmptyState() { 602 945 resultsDiv.innerHTML = ` 603 946 <div class="empty-state"> 604 - <p>search for <a href="https://leaflet.pub" target="_blank">leaflet.pub</a></p> 947 + <p>search atproto publishing platforms</p> 948 + <p style="font-size:11px;margin-top:0.5rem"><a href="https://leaflet.pub" target="_blank">leaflet</a> · <a href="https://pckt.blog" target="_blank">pckt</a> · <a href="https://offprint.app" target="_blank">offprint</a> · <a href="https://greengale.app" target="_blank">greengale</a> · <a href="https://standard.site" target="_blank">other</a></p> 605 949 </div> 606 950 `; 607 951 } ··· 620 964 const params = new URLSearchParams(location.search); 621 965 queryInput.value = params.get('q') || ''; 622 966 currentTag = params.get('tag') || null; 967 + currentPlatform = params.get('platform') || null; 623 968 renderActiveFilter(); 624 969 renderTags(); 625 - if (queryInput.value || currentTag) search(queryInput.value, currentTag); 970 + renderPlatformFilter(); 971 + if (queryInput.value || currentTag || currentPlatform) search(queryInput.value, currentTag, currentPlatform); 626 972 }); 627 973 628 974 // init 629 975 const initialParams = new URLSearchParams(location.search); 630 976 const initialQuery = initialParams.get('q'); 631 977 const initialTag = initialParams.get('tag'); 978 + const initialPlatform = initialParams.get('platform'); 632 979 if (initialQuery) queryInput.value = initialQuery; 633 980 if (initialTag) currentTag = initialTag; 981 + if (initialPlatform) currentPlatform = initialPlatform; 634 982 renderActiveFilter(); 983 + renderPlatformFilter(); 635 984 636 - if (initialQuery || initialTag) { 637 - search(initialQuery || '', initialTag); 985 + if (initialQuery || initialTag || initialPlatform) { 986 + search(initialQuery || '', initialTag, initialPlatform); 638 987 } 639 988 640 989 async function loadRelated(topResult) { ··· 660 1009 if (filtered.length === 0) return; 661 1010 662 1011 const items = filtered.map(doc => { 663 - const url = doc.basePath && doc.rkey 664 - ? `https://${doc.basePath}/${doc.rkey}` 665 - : (doc.did && doc.rkey ? `https://leaflet.pub/p/${doc.did}/${doc.rkey}` : null); 1012 + const platform = doc.platform || 'leaflet'; 1013 + const url = buildDocUrl(doc, doc.type || 'article', platform); 666 1014 return url 667 1015 ? `<a href="${url}" target="_blank" class="related-item">${escapeHtml(doc.title || 'Untitled')}</a>` 668 1016 : `<span class="related-item">${escapeHtml(doc.title || 'Untitled')}</span>`;

+32 -40

site/loading.js

··· 82 82 const style = document.createElement('style'); 83 83 style.id = 'loader-styles'; 84 84 style.textContent = ` 85 - /* skeleton shimmer for loading values */ 85 + /* skeleton shimmer - subtle pulse */ 86 86 .loading .metric-value, 87 87 .loading .doc-count, 88 88 .loading .pub-count { 89 - background: linear-gradient(90deg, #1a1a1a 25%, #252525 50%, #1a1a1a 75%); 90 - background-size: 200% 100%; 91 - animation: shimmer 1.5s infinite; 92 - border-radius: 3px; 93 - color: transparent !important; 94 - min-width: 3ch; 95 - display: inline-block; 89 + color: #333 !important; 90 + animation: dim-pulse 2s ease-in-out infinite; 96 91 } 97 92 98 - @keyframes shimmer { 99 - 0% { background-position: 200% 0; } 100 - 100% { background-position: -200% 0; } 93 + @keyframes dim-pulse { 94 + 0%, 100% { opacity: 0.3; } 95 + 50% { opacity: 0.6; } 101 96 } 102 97 103 - /* wake message */ 98 + /* wake message - terminal style, ephemeral */ 104 99 .wake-message { 105 100 position: fixed; 106 - top: 1rem; 107 - right: 1rem; 101 + bottom: 1rem; 102 + left: 1rem; 103 + font-family: monospace; 108 104 font-size: 11px; 109 - color: #666; 110 - background: #111; 111 - border: 1px solid #222; 112 - padding: 6px 12px; 113 - border-radius: 4px; 114 - display: flex; 115 - align-items: center; 116 - gap: 8px; 105 + color: #444; 117 106 z-index: 1000; 118 - animation: fade-in 0.2s ease; 107 + animation: fade-in 0.5s ease; 108 + } 109 + 110 + .wake-message::before { 111 + content: '>'; 112 + margin-right: 6px; 113 + opacity: 0.5; 119 114 } 120 115 121 116 .wake-dot { 122 - width: 6px; 123 - height: 6px; 124 - background: #4ade80; 117 + display: inline-block; 118 + width: 4px; 119 + height: 4px; 120 + background: #555; 125 121 border-radius: 50%; 126 - animation: pulse-dot 1s infinite; 122 + margin-left: 4px; 123 + animation: blink 1s step-end infinite; 127 124 } 128 125 129 - @keyframes pulse-dot { 130 - 0%, 100% { opacity: 0.3; } 131 - 50% { opacity: 1; } 126 + @keyframes blink { 127 + 0%, 100% { opacity: 1; } 128 + 50% { opacity: 0; } 132 129 } 133 130 134 131 @keyframes fade-in { 135 - from { opacity: 0; transform: translateY(-4px); } 136 - to { opacity: 1; transform: translateY(0); } 132 + from { opacity: 0; } 133 + to { opacity: 1; } 137 134 } 138 135 139 136 .wake-message.fade-out { 140 - animation: fade-out 0.3s ease forwards; 137 + animation: fade-out 0.5s ease forwards; 141 138 } 142 139 143 140 @keyframes fade-out { 144 - to { opacity: 0; transform: translateY(-4px); } 141 + to { opacity: 0; } 145 142 } 146 143 147 144 /* loaded transition */ 148 145 .loaded .metric-value, 149 146 .loaded .doc-count, 150 147 .loaded .pub-count { 151 - animation: reveal 0.3s ease; 152 - } 153 - 154 - @keyframes reveal { 155 - from { opacity: 0; } 156 - to { opacity: 1; } 148 + animation: none; 157 149 } 158 150 `; 159 151 document.head.appendChild(style);

+7 -4

tap/fly.toml

··· 1 1 app = 'leaflet-search-tap' 2 - primary_region = 'iad' 2 + primary_region = 'ewr' 3 3 4 4 [build] 5 5 image = 'ghcr.io/bluesky-social/indigo/tap:latest' ··· 8 8 TAP_DATABASE_URL = 'sqlite:///data/tap.db' 9 9 TAP_BIND = ':2480' 10 10 TAP_RELAY_URL = 'https://relay1.us-east.bsky.network' 11 - TAP_SIGNAL_COLLECTION = 'pub.leaflet.document' 12 - TAP_COLLECTION_FILTERS = 'pub.leaflet.document,pub.leaflet.publication' 13 - TAP_DISABLE_ACKS = 'true' 11 + TAP_SIGNAL_COLLECTION = 'site.standard.document' 12 + TAP_COLLECTION_FILTERS = 'pub.leaflet.document,pub.leaflet.publication,site.standard.document,site.standard.publication' 14 13 TAP_LOG_LEVEL = 'info' 14 + TAP_RESYNC_PARALLELISM = '1' 15 + TAP_FIREHOSE_PARALLELISM = '5' 16 + TAP_OUTBOX_CAPACITY = '10000' 17 + TAP_IDENT_CACHE_SIZE = '10000' 15 18 TAP_CURSOR_SAVE_INTERVAL = '5s' 16 19 TAP_REPO_FETCH_TIMEOUT = '600s' 17 20

+36

tap/justfile

··· 1 1 # tap instance for leaflet-search 2 2 3 + # get machine id 4 + _machine_id := `fly status --app leaflet-search-tap --json 2>/dev/null | jq -r '.Machines[0].id'` 5 + 6 + # crank up parallelism for faster catch-up (uses more memory + CPU) 7 + turbo: 8 + @echo "Switching to TURBO mode (4GB, 2 CPUs, higher parallelism)..." 9 + fly machine update {{ _machine_id }} --app leaflet-search-tap \ 10 + --vm-memory 4096 \ 11 + --vm-cpus 2 \ 12 + -e TAP_RESYNC_PARALLELISM=4 \ 13 + -e TAP_FIREHOSE_PARALLELISM=10 \ 14 + --yes 15 + @echo "TURBO mode enabled. Run 'just normal' when caught up." 16 + 17 + # restore normal settings (lower memory, conservative parallelism) 18 + normal: 19 + @echo "Switching to NORMAL mode (2GB, 1 CPU, conservative parallelism)..." 20 + fly machine update {{ _machine_id }} --app leaflet-search-tap \ 21 + --vm-memory 2048 \ 22 + --vm-cpus 1 \ 23 + -e TAP_RESYNC_PARALLELISM=1 \ 24 + -e TAP_FIREHOSE_PARALLELISM=5 \ 25 + --yes 26 + @echo "NORMAL mode restored." 27 + 28 + # check indexing status - shows most recent indexed documents 29 + check: 30 + @echo "=== tap status ===" 31 + @fly status --app leaflet-search-tap 2>/dev/null | grep -E "(STATE|started|stopped)" 32 + @echo "" 33 + @echo "=== Recent Indexing Activity ===" 34 + @curl -s https://leaflet-search-backend.fly.dev/api/dashboard | jq -r '"Last indexed: \(.timeline[0].date) (\(.timeline[0].count) docs)\nToday: '$(date +%Y-%m-%d)'\nDocs: \(.documents) | Pubs: \(.publications)"' 35 + @echo "" 36 + @echo "=== Timeline (last 7 days) ===" 37 + @curl -s https://leaflet-search-backend.fly.dev/api/dashboard | jq -r '.timeline[:7][] | "\(.date): \(.count) docs"' 38 + 3 39 deploy: 4 40 fly deploy --app leaflet-search-tap 5 41

Compare changes