comparing main and api-platform-filter on zzstoatzz.io/leaflet-search

-23

.github/workflows/backfill-embeddings.yml

··· 1 - name: Backfill Embeddings 2 - 3 - on: 4 - schedule: 5 - # run daily at 6am UTC 6 - - cron: '0 6 * * *' 7 - workflow_dispatch: # allow manual trigger 8 - 9 - jobs: 10 - backfill: 11 - runs-on: ubuntu-latest 12 - steps: 13 - - uses: actions/checkout@v4 14 - 15 - - name: Install uv 16 - uses: astral-sh/setup-uv@v5 17 - 18 - - name: Run backfill 19 - env: 20 - TURSO_URL: ${{ secrets.TURSO_URL }} 21 - TURSO_TOKEN: ${{ secrets.TURSO_TOKEN }} 22 - VOYAGE_API_KEY: ${{ secrets.VOYAGE_API_KEY }} 23 - run: ./scripts/backfill-embeddings --batch-size 50

-27

.github/workflows/deploy-backend.yml

··· 1 - name: Deploy Backend 2 - 3 - on: 4 - push: 5 - branches: [main] 6 - paths: 7 - - 'backend/**' 8 - 9 - concurrency: 10 - group: backend-deploy 11 - cancel-in-progress: true 12 - 13 - env: 14 - FLY_API_TOKEN: ${{ secrets.FLY_API_TOKEN }} 15 - 16 - jobs: 17 - deploy: 18 - runs-on: ubuntu-latest 19 - steps: 20 - - uses: actions/checkout@v4 21 - 22 - - name: Setup flyctl 23 - uses: superfly/flyctl-actions/setup-flyctl@master 24 - 25 - - name: Deploy to Fly.io 26 - working-directory: backend 27 - run: flyctl deploy --remote-only

-1

.gitignore

··· 5 5 *.db 6 6 .zig-cache/ 7 7 zig-out/ 8 - .loq_cache

-34

CLAUDE.md

··· 1 - # leaflet-search notes 2 - 3 - ## deployment 4 - - **backend**: push to `main` touching `backend/**` → auto-deploys via GitHub Actions 5 - - **frontend**: manual deploy only (`wrangler pages deploy site --project-name leaflet-search`) 6 - - **tap**: manual deploy from `tap/` directory (`fly deploy --app leaflet-search-tap`) 7 - 8 - ## remotes 9 - - `origin`: tangled.sh:zzstoatzz.io/leaflet-search 10 - - `github`: github.com/zzstoatzz/leaflet-search (CI runs here) 11 - - push to both: `git push origin main && git push github main` 12 - 13 - ## architecture 14 - - **backend** (Zig): HTTP API, FTS5 search, vector similarity 15 - - **tap**: firehose sync via bluesky-social/indigo tap 16 - - **site**: static frontend on Cloudflare Pages 17 - - **db**: Turso (source of truth) + local SQLite read replica (FTS queries) 18 - 19 - ## platforms 20 - - leaflet, pckt, offprint: known platforms (detected via basePath) 21 - - other: site.standard.* documents not from a known platform 22 - 23 - ## search ranking 24 - - hybrid BM25 + recency: `ORDER BY rank + (days_old / 30)` 25 - - OR between terms for recall, prefix on last word 26 - - unicode61 tokenizer (non-alphanumeric = separator) 27 - 28 - ## tap operations 29 - - from `tap/` directory: `just check` (status), `just turbo` (catch-up), `just normal` (steady state) 30 - - see `docs/tap.md` for memory tuning and debugging 31 - 32 - ## common tasks 33 - - backfill embeddings: `./scripts/backfill-embeddings` 34 - - check indexing: `curl -s https://leaflet-search-backend.fly.dev/api/dashboard | jq`

+13 -34

README.md

··· 1 - # pub search 1 + # leaflet-search 2 2 3 3 by [@zzstoatzz.io](https://bsky.app/profile/zzstoatzz.io) 4 4 5 - search ATProto publishing platforms ([leaflet](https://leaflet.pub), [pckt](https://pckt.blog), [offprint](https://offprint.app), [greengale](https://greengale.app), and others using [standard.site](https://standard.site)). 6 - 7 - **live:** [pub-search.waow.tech](https://pub-search.waow.tech) 5 + search for [leaflet](https://leaflet.pub). 8 6 9 - > formerly "leaflet-search" - generalized to support multiple publishing platforms 7 + **live:** [leaflet-search.pages.dev](https://leaflet-search.pages.dev) 10 8 11 9 ## how it works 12 10 13 - 1. **tap** syncs content from ATProto firehose (signals on `pub.leaflet.document`, filters `pub.leaflet.*` + `site.standard.*`) 11 + 1. **tap** syncs leaflet content from the network 14 12 2. **backend** indexes content into SQLite FTS5 via [Turso](https://turso.tech), serves search API 15 13 3. **site** static frontend on Cloudflare Pages 16 14 ··· 19 17 search is also exposed as an MCP server for AI agents like Claude Code: 20 18 21 19 ```bash 22 - claude mcp add-json pub-search '{"type": "http", "url": "https://pub-search-by-zzstoatzz.fastmcp.app/mcp"}' 20 + claude mcp add-json leaflet '{"type": "http", "url": "https://leaflet-search-by-zzstoatzz.fastmcp.app/mcp"}' 23 21 ``` 24 22 25 23 see [mcp/README.md](mcp/README.md) for local setup and usage details. ··· 27 25 ## api 28 26 29 27 ``` 30 - GET /search?q=<query>&tag=<tag>&platform=<platform>&since=<date> # full-text search 31 - GET /similar?uri=<at-uri> # find similar documents 32 - GET /tags # list all tags with counts 33 - GET /popular # popular search queries 34 - GET /stats # counts + request latency (p50/p95) 35 - GET /health # health check 28 + GET /search?q=<query>&tag=<tag> # full-text search with query, tag, or both 29 + GET /similar?uri=<at-uri> # find similar documents via vector embeddings 30 + GET /tags # list all tags with counts 31 + GET /popular # popular search queries 32 + GET /stats # document/publication counts 33 + GET /health # health check 36 34 ``` 37 35 38 - search returns three entity types: `article` (document in a publication), `looseleaf` (standalone document), `publication` (newsletter itself). each result includes a `platform` field (leaflet, pckt, offprint, greengale, or other). tag and platform filtering apply to documents only. 39 - 40 - **ranking**: results use hybrid BM25 + recency scoring. text relevance is primary, but recent documents get a boost (~1 point per 30 days). the `since` parameter filters to documents created after the given ISO date (e.g., `since=2025-01-01`). 36 + search returns three entity types: `article` (document in a publication), `looseleaf` (standalone document), `publication` (newsletter itself). tag filtering applies to documents only. 41 37 42 38 `/similar` uses [Voyage AI](https://voyageai.com) embeddings with brute-force cosine similarity (~0.15s for 3500 docs). 43 39 44 - ## configuration 45 - 46 - the backend is fully configurable via environment variables: 47 - 48 - | variable | default | description | 49 - |----------|---------|-------------| 50 - | `APP_NAME` | `leaflet-search` | name shown in startup logs | 51 - | `DASHBOARD_URL` | `https://pub-search.waow.tech/dashboard.html` | redirect target for `/dashboard` | 52 - | `TAP_HOST` | `leaflet-search-tap.fly.dev` | tap websocket host | 53 - | `TAP_PORT` | `443` | tap websocket port | 54 - | `PORT` | `3000` | HTTP server port | 55 - | `TURSO_URL` | - | Turso database URL (required) | 56 - | `TURSO_TOKEN` | - | Turso auth token (required) | 57 - | `VOYAGE_API_KEY` | - | Voyage AI API key (for embeddings) | 58 - 59 - the backend indexes multiple ATProto platforms - currently `pub.leaflet.*` and `site.standard.*` collections. platform is stored per-document and returned in search results. 60 - 61 40 ## [stack](https://bsky.app/profile/zzstoatzz.io/post/3mbij5ip4ws2a) 62 41 63 42 - [Fly.io](https://fly.io) hosts backend + tap 64 43 - [Turso](https://turso.tech) cloud SQLite with vector support 65 44 - [Voyage AI](https://voyageai.com) embeddings (voyage-3-lite) 66 - - [tap](https://github.com/bluesky-social/indigo/tree/main/cmd/tap) syncs content from ATProto firehose 45 + - [Tap](https://github.com/bluesky-social/indigo/tree/main/cmd/tap) syncs leaflet content from ATProto firehose 67 46 - [Zig](https://ziglang.org) HTTP server, search API, content indexing 68 47 - [Cloudflare Pages](https://pages.cloudflare.com) static frontend 69 48

-13

backend/build.zig

··· 19 19 .optimize = optimize, 20 20 }); 21 21 22 - const zqlite = b.dependency("zqlite", .{ 23 - .target = target, 24 - .optimize = optimize, 25 - .sqlite3 = &[_][]const u8{ "-std=c99", "-DSQLITE_ENABLE_FTS5" }, 26 - }); 27 - 28 - const logfire = b.dependency("logfire", .{ 29 - .target = target, 30 - .optimize = optimize, 31 - }); 32 - 33 22 const exe = b.addExecutable(.{ 34 23 .name = "leaflet-search", 35 24 .root_module = b.createModule(.{ ··· 40 29 .{ .name = "websocket", .module = websocket.module("websocket") }, 41 30 .{ .name = "zql", .module = zql.module("zql") }, 42 31 .{ .name = "zat", .module = zat.module("zat") }, 43 - .{ .name = "zqlite", .module = zqlite.module("zqlite") }, 44 - .{ .name = "logfire", .module = logfire.module("logfire") }, 45 32 }, 46 33 }), 47 34 });

+2 -10

backend/build.zig.zon

··· 13 13 .hash = "zql-0.0.1-alpha-xNRI4IRNAABUb9gLat5FWUaZDD5HvxAxet_-elgR_A_y", 14 14 }, 15 15 .zat = .{ 16 - .url = "https://tangled.sh/zat.dev/zat/archive/main", 17 - .hash = "zat-0.1.0-5PuC7heIAQA4j2UVmJT-oivQh5AwZTrFQ-NC4CJi2-_R", 18 - }, 19 - .zqlite = .{ 20 - .url = "https://github.com/karlseguin/zqlite.zig/archive/refs/heads/master.tar.gz", 21 - .hash = "zqlite-0.0.0-RWLaY_y_mADh2LdbDrG_2HT2dBAcsAR8Jig_7-dOJd0B", 22 - }, 23 - .logfire = .{ 24 - .url = "https://tangled.sh/zzstoatzz.io/logfire-zig/archive/main", 25 - .hash = "logfire_zig-0.1.0-x2yDLqgdAQDEVTY2QD0rpP2qemMPMppJXbFgkgSkinh3", 16 + .url = "https://tangled.sh/zzstoatzz.io/zat/archive/main", 17 + .hash = "zat-0.1.0-5PuC7ntmAQA9_8rALQwWad2riXWTY9p_ohVOD54_Y-2c", 26 18 }, 27 19 }, 28 20 .paths = .{

+2 -6

backend/fly.toml

··· 10 10 [http_service] 11 11 internal_port = 3000 12 12 force_https = true 13 - auto_stop_machines = 'off' 13 + auto_stop_machines = 'stop' 14 14 auto_start_machines = true 15 15 min_machines_running = 1 16 16 processes = ['app'] 17 17 18 - [mounts] 19 - source = 'leaflet_data' 20 - destination = '/data' 21 - 22 18 [[vm]] 23 - memory = '512mb' 19 + memory = '256mb' 24 20 cpu_kind = 'shared' 25 21 cpus = 1

+18 -195

backend/src/dashboard.zig

··· 2 2 const json = std.json; 3 3 const Allocator = std.mem.Allocator; 4 4 const db = @import("db/mod.zig"); 5 - const timing = @import("timing.zig"); 6 5 7 6 // JSON output types 8 7 const TagJson = struct { tag: []const u8, count: i64 }; 9 8 const TimelineJson = struct { date: []const u8, count: i64 }; 10 9 const PubJson = struct { name: []const u8, basePath: []const u8, count: i64 }; 11 - const PlatformJson = struct { platform: []const u8, count: i64 }; 12 10 13 11 /// All data needed to render the dashboard 14 12 pub const Data = struct { 15 13 started_at: i64, 16 14 searches: i64, 17 15 publications: i64, 18 - documents: i64, 16 + articles: i64, 17 + looseleafs: i64, 19 18 tags_json: []const u8, 20 19 timeline_json: []const u8, 21 20 top_pubs_json: []const u8, 22 - platforms_json: []const u8, 23 - timing_json: []const u8, 24 21 }; 25 22 26 23 // all dashboard queries batched into one request ··· 33 30 \\ (SELECT service_started_at FROM stats WHERE id = 1) as started_at 34 31 ; 35 32 36 - const PLATFORMS_SQL = 37 - \\SELECT platform, COUNT(*) as count 33 + const DOC_TYPES_SQL = 34 + \\SELECT 35 + \\ SUM(CASE WHEN publication_uri != '' THEN 1 ELSE 0 END) as articles, 36 + \\ SUM(CASE WHEN publication_uri = '' OR publication_uri IS NULL THEN 1 ELSE 0 END) as looseleafs 38 37 \\FROM documents 39 - \\GROUP BY platform 40 - \\ORDER BY count DESC 41 38 ; 42 39 43 40 const TAGS_SQL = ··· 67 64 ; 68 65 69 66 pub fn fetch(alloc: Allocator) !Data { 70 - // try local SQLite first (fast) 71 - if (db.getLocalDb()) |local| { 72 - if (fetchLocal(alloc, local)) |result| { 73 - return result; 74 - } else |_| {} 75 - } 76 - 77 - // fall back to Turso (slow) 78 67 const client = db.getClient() orelse return error.NotInitialized; 79 68 80 69 // batch all 5 queries into one HTTP request 81 70 var batch = client.queryBatch(&.{ 82 71 .{ .sql = STATS_SQL }, 83 - .{ .sql = PLATFORMS_SQL }, 72 + .{ .sql = DOC_TYPES_SQL }, 84 73 .{ .sql = TAGS_SQL }, 85 74 .{ .sql = TIMELINE_SQL }, 86 75 .{ .sql = TOP_PUBS_SQL }, ··· 92 81 const started_at = if (stats_row) |r| r.int(4) else 0; 93 82 const searches = if (stats_row) |r| r.int(2) else 0; 94 83 const publications = if (stats_row) |r| r.int(1) else 0; 95 - const documents = if (stats_row) |r| r.int(0) else 0; 84 + 85 + // extract doc types (query 1) 86 + const doc_row = batch.getFirst(1); 87 + const articles = if (doc_row) |r| r.int(0) else 0; 88 + const looseleafs = if (doc_row) |r| r.int(1) else 0; 96 89 97 90 return .{ 98 91 .started_at = started_at, 99 92 .searches = searches, 100 93 .publications = publications, 101 - .documents = documents, 94 + .articles = articles, 95 + .looseleafs = looseleafs, 102 96 .tags_json = try formatTagsJson(alloc, batch.get(2)), 103 97 .timeline_json = try formatTimelineJson(alloc, batch.get(3)), 104 98 .top_pubs_json = try formatPubsJson(alloc, batch.get(4)), 105 - .platforms_json = try formatPlatformsJson(alloc, batch.get(1)), 106 - .timing_json = try formatTimingJson(alloc), 107 99 }; 108 100 } 109 101 110 - fn fetchLocal(alloc: Allocator, local: *db.LocalDb) !Data { 111 - // get stats from Turso (searches/started_at don't sync to local replica) 112 - const client = db.getClient() orelse return error.NotInitialized; 113 - var stats_res = client.query( 114 - \\SELECT total_searches, service_started_at FROM stats WHERE id = 1 115 - , &.{}) catch return error.QueryFailed; 116 - defer stats_res.deinit(); 117 - const turso_stats = stats_res.first(); 118 - const searches = if (turso_stats) |r| r.int(0) else 0; 119 - const started_at = if (turso_stats) |r| r.int(1) else 0; 120 - 121 - // get document/publication counts from local (fast) 122 - var counts_rows = try local.query( 123 - \\SELECT 124 - \\ (SELECT COUNT(*) FROM documents) as docs, 125 - \\ (SELECT COUNT(*) FROM publications) as pubs 126 - , .{}); 127 - defer counts_rows.deinit(); 128 - const counts_row = counts_rows.next() orelse return error.NoStats; 129 - const documents = counts_row.int(0); 130 - const publications = counts_row.int(1); 131 - 132 - // platforms query 133 - var platforms_rows = try local.query(PLATFORMS_SQL, .{}); 134 - defer platforms_rows.deinit(); 135 - const platforms_json = try formatPlatformsJsonLocal(alloc, &platforms_rows); 136 - 137 - // tags query 138 - var tags_rows = try local.query(TAGS_SQL, .{}); 139 - defer tags_rows.deinit(); 140 - const tags_json = try formatTagsJsonLocal(alloc, &tags_rows); 141 - 142 - // timeline query 143 - var timeline_rows = try local.query(TIMELINE_SQL, .{}); 144 - defer timeline_rows.deinit(); 145 - const timeline_json = try formatTimelineJsonLocal(alloc, &timeline_rows); 146 - 147 - // top pubs query 148 - var pubs_rows = try local.query(TOP_PUBS_SQL, .{}); 149 - defer pubs_rows.deinit(); 150 - const top_pubs_json = try formatPubsJsonLocal(alloc, &pubs_rows); 151 - 152 - return .{ 153 - .started_at = started_at, 154 - .searches = searches, 155 - .publications = publications, 156 - .documents = documents, 157 - .tags_json = tags_json, 158 - .timeline_json = timeline_json, 159 - .top_pubs_json = top_pubs_json, 160 - .platforms_json = platforms_json, 161 - .timing_json = try formatTimingJson(alloc), 162 - }; 163 - } 164 - 165 - fn formatTagsJsonLocal(alloc: Allocator, rows: *db.LocalDb.Rows) ![]const u8 { 166 - var output: std.Io.Writer.Allocating = .init(alloc); 167 - errdefer output.deinit(); 168 - var jw: json.Stringify = .{ .writer = &output.writer }; 169 - try jw.beginArray(); 170 - while (rows.next()) |row| { 171 - try jw.write(TagJson{ .tag = row.text(0), .count = row.int(1) }); 172 - } 173 - try jw.endArray(); 174 - return try output.toOwnedSlice(); 175 - } 176 - 177 - fn formatTimelineJsonLocal(alloc: Allocator, rows: *db.LocalDb.Rows) ![]const u8 { 178 - var output: std.Io.Writer.Allocating = .init(alloc); 179 - errdefer output.deinit(); 180 - var jw: json.Stringify = .{ .writer = &output.writer }; 181 - try jw.beginArray(); 182 - while (rows.next()) |row| { 183 - try jw.write(TimelineJson{ .date = row.text(0), .count = row.int(1) }); 184 - } 185 - try jw.endArray(); 186 - return try output.toOwnedSlice(); 187 - } 188 - 189 - fn formatPubsJsonLocal(alloc: Allocator, rows: *db.LocalDb.Rows) ![]const u8 { 190 - var output: std.Io.Writer.Allocating = .init(alloc); 191 - errdefer output.deinit(); 192 - var jw: json.Stringify = .{ .writer = &output.writer }; 193 - try jw.beginArray(); 194 - while (rows.next()) |row| { 195 - try jw.write(PubJson{ .name = row.text(0), .basePath = row.text(1), .count = row.int(2) }); 196 - } 197 - try jw.endArray(); 198 - return try output.toOwnedSlice(); 199 - } 200 - 201 - fn formatPlatformsJsonLocal(alloc: Allocator, rows: *db.LocalDb.Rows) ![]const u8 { 202 - var output: std.Io.Writer.Allocating = .init(alloc); 203 - errdefer output.deinit(); 204 - var jw: json.Stringify = .{ .writer = &output.writer }; 205 - try jw.beginArray(); 206 - while (rows.next()) |row| { 207 - try jw.write(PlatformJson{ .platform = row.text(0), .count = row.int(1) }); 208 - } 209 - try jw.endArray(); 210 - return try output.toOwnedSlice(); 211 - } 212 - 213 102 fn formatTagsJson(alloc: Allocator, rows: []const db.Row) ![]const u8 { 214 103 var output: std.Io.Writer.Allocating = .init(alloc); 215 104 errdefer output.deinit(); ··· 240 129 return try output.toOwnedSlice(); 241 130 } 242 131 243 - fn formatPlatformsJson(alloc: Allocator, rows: []const db.Row) ![]const u8 { 244 - var output: std.Io.Writer.Allocating = .init(alloc); 245 - errdefer output.deinit(); 246 - var jw: json.Stringify = .{ .writer = &output.writer }; 247 - try jw.beginArray(); 248 - for (rows) |row| try jw.write(PlatformJson{ .platform = row.text(0), .count = row.int(1) }); 249 - try jw.endArray(); 250 - return try output.toOwnedSlice(); 251 - } 252 - 253 - fn formatTimingJson(alloc: Allocator) ![]const u8 { 254 - const all_timing = timing.getAllStats(); 255 - const all_series = timing.getAllTimeSeries(); 256 - 257 - var output: std.Io.Writer.Allocating = .init(alloc); 258 - errdefer output.deinit(); 259 - var jw: json.Stringify = .{ .writer = &output.writer }; 260 - 261 - try jw.beginObject(); 262 - inline for (@typeInfo(timing.Endpoint).@"enum".fields, 0..) |field, i| { 263 - const t = all_timing[i]; 264 - const series = all_series[i]; 265 - try jw.objectField(field.name); 266 - try jw.beginObject(); 267 - try jw.objectField("count"); 268 - try jw.write(t.count); 269 - try jw.objectField("avg_ms"); 270 - try jw.write(t.avg_ms); 271 - try jw.objectField("p50_ms"); 272 - try jw.write(t.p50_ms); 273 - try jw.objectField("p95_ms"); 274 - try jw.write(t.p95_ms); 275 - try jw.objectField("p99_ms"); 276 - try jw.write(t.p99_ms); 277 - try jw.objectField("max_ms"); 278 - try jw.write(t.max_ms); 279 - // add 24h time series 280 - try jw.objectField("history"); 281 - try jw.beginArray(); 282 - for (series) |point| { 283 - try jw.beginObject(); 284 - try jw.objectField("hour"); 285 - try jw.write(point.hour); 286 - try jw.objectField("count"); 287 - try jw.write(point.count); 288 - try jw.objectField("avg_ms"); 289 - try jw.write(point.avg_ms); 290 - try jw.objectField("max_ms"); 291 - try jw.write(point.max_ms); 292 - try jw.endObject(); 293 - } 294 - try jw.endArray(); 295 - try jw.endObject(); 296 - } 297 - try jw.endObject(); 298 - 299 - return try output.toOwnedSlice(); 300 - } 301 - 302 132 /// Generate dashboard data as JSON for API endpoint 303 133 pub fn toJson(alloc: Allocator, data: Data) ![]const u8 { 304 134 var output: std.Io.Writer.Allocating = .init(alloc); ··· 316 146 try jw.objectField("publications"); 317 147 try jw.write(data.publications); 318 148 319 - try jw.objectField("documents"); 320 - try jw.write(data.documents); 149 + try jw.objectField("articles"); 150 + try jw.write(data.articles); 321 151 322 - try jw.objectField("platforms"); 323 - try jw.beginWriteRaw(); 324 - try jw.writer.writeAll(data.platforms_json); 325 - jw.endWriteRaw(); 152 + try jw.objectField("looseleafs"); 153 + try jw.write(data.looseleafs); 326 154 327 155 // use beginWriteRaw/endWriteRaw for pre-formatted JSON arrays 328 156 try jw.objectField("tags"); ··· 338 166 try jw.objectField("topPubs"); 339 167 try jw.beginWriteRaw(); 340 168 try jw.writer.writeAll(data.top_pubs_json); 341 - jw.endWriteRaw(); 342 - 343 - try jw.objectField("timing"); 344 - try jw.beginWriteRaw(); 345 - try jw.writer.writeAll(data.timing_json); 346 169 jw.endWriteRaw(); 347 170 348 171 try jw.endObject();

+4 -23

backend/src/db/Client.zig

··· 6 6 const json = std.json; 7 7 const mem = std.mem; 8 8 const Allocator = mem.Allocator; 9 - const logfire = @import("logfire"); 10 9 11 10 const result = @import("result.zig"); 12 11 pub const Result = result.Result; ··· 117 116 } 118 117 119 118 fn executeRaw(self: *Client, sql: []const u8, args: []const []const u8) ![]const u8 { 120 - const span = logfire.span("db.query", .{ 121 - .sql = truncateSql(sql), 122 - .args_count = @as(i64, @intCast(args.len)), 123 - }); 124 - defer span.end(); 125 - 126 119 self.mutex.lock(); 127 120 defer self.mutex.unlock(); 128 121 ··· 150 143 .payload = body, 151 144 .response_writer = &response_body.writer, 152 145 }) catch |err| { 153 - logfire.err("turso request failed: {}", .{err}); 146 + std.debug.print("turso request failed: {}\n", .{err}); 154 147 return error.HttpError; 155 148 }; 156 149 157 150 if (res.status != .ok) { 158 - logfire.err("turso error: {}", .{res.status}); 151 + std.debug.print("turso error: {}\n", .{res.status}); 159 152 return error.TursoError; 160 153 } 161 154 ··· 163 156 } 164 157 165 158 fn executeBatchRaw(self: *Client, statements: []const Statement) ![]const u8 { 166 - const first_sql = if (statements.len > 0) truncateSql(statements[0].sql) else ""; 167 - const span = logfire.span("db.batch", .{ 168 - .statement_count = @as(i64, @intCast(statements.len)), 169 - .first_sql = first_sql, 170 - }); 171 - defer span.end(); 172 - 173 159 self.mutex.lock(); 174 160 defer self.mutex.unlock(); 175 161 ··· 197 183 .payload = body, 198 184 .response_writer = &response_body.writer, 199 185 }) catch |err| { 200 - logfire.err("turso batch request failed: {}", .{err}); 186 + std.debug.print("turso batch request failed: {}\n", .{err}); 201 187 return error.HttpError; 202 188 }; 203 189 204 190 if (res.status != .ok) { 205 - logfire.err("turso batch error: {}", .{res.status}); 191 + std.debug.print("turso batch error: {}\n", .{res.status}); 206 192 return error.TursoError; 207 193 } 208 194 ··· 298 284 299 285 return 0; 300 286 } 301 - 302 - fn truncateSql(sql: []const u8) []const u8 { 303 - const max_len = 100; 304 - return if (sql.len > max_len) sql[0..max_len] else sql; 305 - }

-327

backend/src/db/LocalDb.zig

··· 1 - //! Local SQLite read replica using zqlite 2 - //! Provides fast FTS5 queries while Turso remains source of truth 3 - 4 - const std = @import("std"); 5 - const posix = std.posix; 6 - const zqlite = @import("zqlite"); 7 - const Allocator = std.mem.Allocator; 8 - const logfire = @import("logfire"); 9 - 10 - const LocalDb = @This(); 11 - 12 - conn: ?zqlite.Conn = null, 13 - allocator: Allocator, 14 - is_ready: std.atomic.Value(bool) = std.atomic.Value(bool).init(false), 15 - needs_resync: std.atomic.Value(bool) = std.atomic.Value(bool).init(false), 16 - mutex: std.Thread.Mutex = .{}, 17 - path: []const u8 = "", 18 - consecutive_errors: std.atomic.Value(u32) = std.atomic.Value(u32).init(0), 19 - 20 - pub fn init(allocator: Allocator) LocalDb { 21 - return .{ .allocator = allocator }; 22 - } 23 - 24 - /// Check database integrity and return false if corrupt 25 - fn checkIntegrity(self: *LocalDb) bool { 26 - const c = self.conn orelse return false; 27 - const row = c.row("PRAGMA integrity_check", .{}) catch return false; 28 - if (row) |r| { 29 - defer r.deinit(); 30 - const result = r.text(0); 31 - if (std.mem.eql(u8, result, "ok")) { 32 - return true; 33 - } 34 - std.debug.print("local db: integrity check failed: {s}\n", .{result}); 35 - return false; 36 - } 37 - return false; 38 - } 39 - 40 - /// Delete the database file and WAL/SHM files 41 - fn deleteDbFiles(path: []const u8) void { 42 - std.fs.cwd().deleteFile(path) catch {}; 43 - // also delete WAL and SHM files 44 - var wal_buf: [260]u8 = undefined; 45 - var shm_buf: [260]u8 = undefined; 46 - if (path.len < 252) { 47 - const wal_path = std.fmt.bufPrint(&wal_buf, "{s}-wal", .{path}) catch return; 48 - const shm_path = std.fmt.bufPrint(&shm_buf, "{s}-shm", .{path}) catch return; 49 - std.fs.cwd().deleteFile(wal_path) catch {}; 50 - std.fs.cwd().deleteFile(shm_path) catch {}; 51 - } 52 - } 53 - 54 - pub fn open(self: *LocalDb) !void { 55 - const path_env = posix.getenv("LOCAL_DB_PATH") orelse "/data/local.db"; 56 - self.path = path_env; 57 - 58 - try self.openDb(path_env, false); 59 - } 60 - 61 - fn openDb(self: *LocalDb, path_env: []const u8, is_retry: bool) !void { 62 - // convert to null-terminated for zqlite 63 - var path_buf: [256]u8 = undefined; 64 - if (path_env.len >= path_buf.len) return error.PathTooLong; 65 - @memcpy(path_buf[0..path_env.len], path_env); 66 - path_buf[path_env.len] = 0; 67 - const path: [*:0]const u8 = path_buf[0..path_env.len :0]; 68 - 69 - std.debug.print("local db: opening {s}\n", .{path_env}); 70 - 71 - const flags = zqlite.OpenFlags.Create | zqlite.OpenFlags.ReadWrite; 72 - self.conn = zqlite.open(path, flags) catch |err| { 73 - std.debug.print("local db: failed to open: {}\n", .{err}); 74 - return err; 75 - }; 76 - 77 - // enable WAL for better concurrency 78 - _ = self.conn.?.exec("PRAGMA journal_mode=WAL", .{}) catch {}; 79 - _ = self.conn.?.exec("PRAGMA busy_timeout=5000", .{}) catch {}; 80 - 81 - // check integrity - if corrupt, delete and recreate 82 - if (!self.checkIntegrity()) { 83 - if (is_retry) { 84 - std.debug.print("local db: still corrupt after recreation, giving up\n", .{}); 85 - return error.DatabaseCorrupt; 86 - } 87 - std.debug.print("local db: corrupt, deleting and recreating\n", .{}); 88 - if (self.conn) |c| c.close(); 89 - self.conn = null; 90 - deleteDbFiles(path_env); 91 - return self.openDb(path_env, true); 92 - } 93 - 94 - try self.createSchema(); 95 - std.debug.print("local db: initialized\n", .{}); 96 - } 97 - 98 - pub fn deinit(self: *LocalDb) void { 99 - if (self.conn) |c| c.close(); 100 - self.conn = null; 101 - } 102 - 103 - pub fn isReady(self: *LocalDb) bool { 104 - return self.is_ready.load(.acquire); 105 - } 106 - 107 - pub fn setReady(self: *LocalDb, ready: bool) void { 108 - self.is_ready.store(ready, .release); 109 - } 110 - 111 - fn createSchema(self: *LocalDb) !void { 112 - const c = self.conn orelse return error.NotOpen; 113 - 114 - // documents table (no embedding column - vectors stay on Turso) 115 - c.exec( 116 - \\CREATE TABLE IF NOT EXISTS documents ( 117 - \\ uri TEXT PRIMARY KEY, 118 - \\ did TEXT NOT NULL, 119 - \\ rkey TEXT NOT NULL, 120 - \\ title TEXT NOT NULL, 121 - \\ content TEXT NOT NULL, 122 - \\ created_at TEXT, 123 - \\ publication_uri TEXT, 124 - \\ platform TEXT DEFAULT 'leaflet', 125 - \\ source_collection TEXT, 126 - \\ path TEXT, 127 - \\ base_path TEXT DEFAULT '', 128 - \\ has_publication INTEGER DEFAULT 0 129 - \\) 130 - , .{}) catch |err| { 131 - std.debug.print("local db: failed to create documents table: {}\n", .{err}); 132 - return err; 133 - }; 134 - 135 - // FTS5 index (unicode61 tokenizer to match Turso) 136 - c.exec( 137 - \\CREATE VIRTUAL TABLE IF NOT EXISTS documents_fts USING fts5( 138 - \\ uri UNINDEXED, 139 - \\ title, 140 - \\ content, 141 - \\ tokenize='unicode61' 142 - \\) 143 - , .{}) catch |err| { 144 - std.debug.print("local db: failed to create documents_fts: {}\n", .{err}); 145 - return err; 146 - }; 147 - 148 - // publications table (no created_at - matches Turso schema) 149 - c.exec( 150 - \\CREATE TABLE IF NOT EXISTS publications ( 151 - \\ uri TEXT PRIMARY KEY, 152 - \\ did TEXT NOT NULL, 153 - \\ rkey TEXT NOT NULL, 154 - \\ name TEXT NOT NULL, 155 - \\ description TEXT, 156 - \\ base_path TEXT, 157 - \\ platform TEXT DEFAULT 'leaflet', 158 - \\ source_collection TEXT 159 - \\) 160 - , .{}) catch |err| { 161 - std.debug.print("local db: failed to create publications table: {}\n", .{err}); 162 - return err; 163 - }; 164 - 165 - // publications FTS 166 - c.exec( 167 - \\CREATE VIRTUAL TABLE IF NOT EXISTS publications_fts USING fts5( 168 - \\ uri UNINDEXED, 169 - \\ name, 170 - \\ description, 171 - \\ base_path, 172 - \\ tokenize='unicode61' 173 - \\) 174 - , .{}) catch |err| { 175 - std.debug.print("local db: failed to create publications_fts: {}\n", .{err}); 176 - return err; 177 - }; 178 - 179 - // document_tags table 180 - c.exec( 181 - \\CREATE TABLE IF NOT EXISTS document_tags ( 182 - \\ document_uri TEXT NOT NULL, 183 - \\ tag TEXT NOT NULL, 184 - \\ PRIMARY KEY (document_uri, tag) 185 - \\) 186 - , .{}) catch |err| { 187 - std.debug.print("local db: failed to create document_tags table: {}\n", .{err}); 188 - return err; 189 - }; 190 - 191 - // index for tag queries 192 - c.exec("CREATE INDEX IF NOT EXISTS idx_document_tags_tag ON document_tags(tag)", .{}) catch {}; 193 - 194 - // sync metadata table 195 - c.exec( 196 - \\CREATE TABLE IF NOT EXISTS sync_meta ( 197 - \\ key TEXT PRIMARY KEY, 198 - \\ value TEXT 199 - \\) 200 - , .{}) catch |err| { 201 - std.debug.print("local db: failed to create sync_meta table: {}\n", .{err}); 202 - return err; 203 - }; 204 - 205 - // stats table for local counters 206 - c.exec( 207 - \\CREATE TABLE IF NOT EXISTS stats ( 208 - \\ id INTEGER PRIMARY KEY CHECK (id = 1), 209 - \\ total_searches INTEGER DEFAULT 0, 210 - \\ total_errors INTEGER DEFAULT 0, 211 - \\ service_started_at INTEGER 212 - \\) 213 - , .{}) catch {}; 214 - c.exec("INSERT OR IGNORE INTO stats (id) VALUES (1)", .{}) catch {}; 215 - 216 - // popular searches 217 - c.exec( 218 - \\CREATE TABLE IF NOT EXISTS popular_searches ( 219 - \\ query TEXT PRIMARY KEY, 220 - \\ count INTEGER DEFAULT 1 221 - \\) 222 - , .{}) catch {}; 223 - } 224 - 225 - /// Row adapter matching result.Row interface (column-indexed access) 226 - pub const Row = struct { 227 - stmt: zqlite.Row, 228 - 229 - pub fn text(self: Row, index: usize) []const u8 { 230 - return self.stmt.text(index); 231 - } 232 - 233 - pub fn int(self: Row, index: usize) i64 { 234 - return self.stmt.int(index); 235 - } 236 - }; 237 - 238 - /// Iterator for query results 239 - pub const Rows = struct { 240 - inner: zqlite.Rows, 241 - 242 - pub fn next(self: *Rows) ?Row { 243 - if (self.inner.next()) |r| { 244 - return .{ .stmt = r }; 245 - } 246 - return null; 247 - } 248 - 249 - pub fn deinit(self: *Rows) void { 250 - self.inner.deinit(); 251 - } 252 - 253 - pub fn err(self: *Rows) ?anyerror { 254 - return self.inner.err; 255 - } 256 - }; 257 - 258 - /// Execute a SELECT query with comptime SQL, returns row iterator 259 - pub fn query(self: *LocalDb, comptime sql: []const u8, args: anytype) !Rows { 260 - const span = logfire.span("db.local.query", .{ 261 - .sql = truncateSql(sql), 262 - }); 263 - defer span.end(); 264 - 265 - self.mutex.lock(); 266 - defer self.mutex.unlock(); 267 - 268 - const c = self.conn orelse return error.NotOpen; 269 - const rows = c.rows(sql, args) catch |e| { 270 - logfire.err("local db query error: {}", .{e}); 271 - return e; 272 - }; 273 - return .{ .inner = rows }; 274 - } 275 - 276 - /// Execute a SELECT query expecting single row 277 - pub fn queryOne(self: *LocalDb, comptime sql: []const u8, args: anytype) !?Row { 278 - const span = logfire.span("db.local.query", .{ 279 - .sql = truncateSql(sql), 280 - }); 281 - defer span.end(); 282 - 283 - self.mutex.lock(); 284 - defer self.mutex.unlock(); 285 - 286 - const c = self.conn orelse return error.NotOpen; 287 - const row = c.row(sql, args) catch |e| { 288 - logfire.err("local db queryOne error: {}", .{e}); 289 - return e; 290 - }; 291 - if (row) |r| { 292 - return .{ .stmt = r }; 293 - } 294 - return null; 295 - } 296 - 297 - /// Execute a statement (INSERT, UPDATE, DELETE) 298 - pub fn exec(self: *LocalDb, comptime sql: []const u8, args: anytype) !void { 299 - self.mutex.lock(); 300 - defer self.mutex.unlock(); 301 - 302 - const c = self.conn orelse return error.NotOpen; 303 - c.exec(sql, args) catch |e| { 304 - std.debug.print("local db exec error: {}\n", .{e}); 305 - return e; 306 - }; 307 - } 308 - 309 - /// Get raw connection for batch operations (caller must handle locking) 310 - pub fn getConn(self: *LocalDb) ?zqlite.Conn { 311 - return self.conn; 312 - } 313 - 314 - /// Lock for batch operations 315 - pub fn lock(self: *LocalDb) void { 316 - self.mutex.lock(); 317 - } 318 - 319 - /// Unlock after batch operations 320 - pub fn unlock(self: *LocalDb) void { 321 - self.mutex.unlock(); 322 - } 323 - 324 - fn truncateSql(sql: []const u8) []const u8 { 325 - const max_len = 100; 326 - return if (sql.len > max_len) sql[0..max_len] else sql; 327 - }

+1 -87

backend/src/db/mod.zig

··· 1 1 const std = @import("std"); 2 - const posix = std.posix; 3 2 4 3 const schema = @import("schema.zig"); 5 4 const result = @import("result.zig"); 6 - const sync = @import("sync.zig"); 7 5 8 6 // re-exports 9 7 pub const Client = @import("Client.zig"); 10 - pub const LocalDb = @import("LocalDb.zig"); 11 8 pub const Row = result.Row; 12 9 pub const Result = result.Result; 13 10 pub const BatchResult = result.BatchResult; ··· 15 12 // global state 16 13 var gpa: std.heap.GeneralPurposeAllocator(.{}) = .{}; 17 14 var client: ?Client = null; 18 - var local_db: ?LocalDb = null; 19 15 20 - /// Initialize Turso client only (fast, call synchronously at startup) 21 - pub fn initTurso() !void { 16 + pub fn init() !void { 22 17 client = try Client.init(gpa.allocator()); 23 18 try schema.init(&client.?); 24 19 } 25 20 26 - /// Initialize local SQLite replica (slow, call in background thread) 27 - pub fn initLocalDb() void { 28 - initLocal() catch |err| { 29 - std.debug.print("local db init failed (will use turso only): {}\n", .{err}); 30 - }; 31 - } 32 - 33 - pub fn init() !void { 34 - try initTurso(); 35 - initLocalDb(); 36 - } 37 - 38 - fn initLocal() !void { 39 - // check if local db is disabled 40 - if (posix.getenv("LOCAL_DB_ENABLED")) |val| { 41 - if (std.mem.eql(u8, val, "false") or std.mem.eql(u8, val, "0")) { 42 - std.debug.print("local db disabled via LOCAL_DB_ENABLED\n", .{}); 43 - return; 44 - } 45 - } 46 - 47 - local_db = LocalDb.init(gpa.allocator()); 48 - try local_db.?.open(); 49 - } 50 - 51 21 pub fn getClient() ?*Client { 52 22 if (client) |*c| return c; 53 23 return null; 54 24 } 55 - 56 - /// Get local db if ready (synced and available) 57 - pub fn getLocalDb() ?*LocalDb { 58 - if (local_db) |*l| { 59 - if (l.isReady()) return l; 60 - } 61 - return null; 62 - } 63 - 64 - /// Get local db even if not ready (for sync operations) 65 - pub fn getLocalDbRaw() ?*LocalDb { 66 - if (local_db) |*l| return l; 67 - return null; 68 - } 69 - 70 - /// Start background sync thread (call from main after db.init) 71 - pub fn startSync() void { 72 - const c = getClient() orelse { 73 - std.debug.print("sync: no turso client, skipping\n", .{}); 74 - return; 75 - }; 76 - const local = getLocalDbRaw() orelse { 77 - std.debug.print("sync: no local db, skipping\n", .{}); 78 - return; 79 - }; 80 - 81 - const thread = std.Thread.spawn(.{}, syncLoop, .{ c, local }) catch |err| { 82 - std.debug.print("sync: failed to start thread: {}\n", .{err}); 83 - return; 84 - }; 85 - thread.detach(); 86 - std.debug.print("sync: background thread started\n", .{}); 87 - } 88 - 89 - fn syncLoop(turso: *Client, local: *LocalDb) void { 90 - // full sync on startup 91 - sync.fullSync(turso, local) catch |err| { 92 - std.debug.print("sync: initial full sync failed: {}\n", .{err}); 93 - }; 94 - 95 - // get sync interval from env (default 5 minutes) 96 - const interval_secs: u64 = blk: { 97 - const env_val = posix.getenv("SYNC_INTERVAL_SECS") orelse "300"; 98 - break :blk std.fmt.parseInt(u64, env_val, 10) catch 300; 99 - }; 100 - 101 - std.debug.print("sync: incremental sync every {d} seconds\n", .{interval_secs}); 102 - 103 - // periodic incremental sync 104 - while (true) { 105 - std.Thread.sleep(interval_secs * std.time.ns_per_s); 106 - sync.incrementalSync(turso, local) catch |err| { 107 - std.debug.print("sync: incremental sync failed: {}\n", .{err}); 108 - }; 109 - } 110 - }

+1 -105

backend/src/db/schema.zig

··· 44 44 \\CREATE VIRTUAL TABLE IF NOT EXISTS publications_fts USING fts5( 45 45 \\ uri UNINDEXED, 46 46 \\ name, 47 - \\ description, 48 - \\ base_path 47 + \\ description 49 48 \\) 50 49 , &.{}); 51 50 ··· 128 127 client.exec("UPDATE documents SET platform = 'leaflet' WHERE platform IS NULL", &.{}) catch {}; 129 128 client.exec("UPDATE documents SET source_collection = 'pub.leaflet.document' WHERE source_collection IS NULL", &.{}) catch {}; 130 129 131 - // multi-platform support for publications 132 - client.exec("ALTER TABLE publications ADD COLUMN platform TEXT DEFAULT 'leaflet'", &.{}) catch {}; 133 - client.exec("ALTER TABLE publications ADD COLUMN source_collection TEXT DEFAULT 'pub.leaflet.publication'", &.{}) catch {}; 134 - client.exec("UPDATE publications SET platform = 'leaflet' WHERE platform IS NULL", &.{}) catch {}; 135 - client.exec("UPDATE publications SET source_collection = 'pub.leaflet.publication' WHERE source_collection IS NULL", &.{}) catch {}; 136 - 137 130 // vector embeddings column already added by backfill script 138 - 139 - // dedupe index: same (did, rkey) across collections = same document 140 - // e.g., pub.leaflet.document/abc and site.standard.document/abc are the same content 141 - client.exec("CREATE UNIQUE INDEX IF NOT EXISTS idx_documents_did_rkey ON documents(did, rkey)", &.{}) catch {}; 142 - client.exec("CREATE UNIQUE INDEX IF NOT EXISTS idx_publications_did_rkey ON publications(did, rkey)", &.{}) catch {}; 143 - 144 - // backfill platform from source_collection for records indexed before platform detection fix 145 - client.exec("UPDATE documents SET platform = 'leaflet' WHERE platform = 'unknown' AND source_collection LIKE 'pub.leaflet.%'", &.{}) catch {}; 146 - client.exec("UPDATE documents SET platform = 'pckt' WHERE platform = 'unknown' AND source_collection LIKE 'blog.pckt.%'", &.{}) catch {}; 147 - 148 - // rename 'standardsite' to 'other' (standardsite was a misnomer - it's a lexicon, not a platform) 149 - // documents using site.standard.* that don't match a known platform are simply "other" 150 - client.exec("UPDATE documents SET platform = 'other' WHERE platform = 'standardsite'", &.{}) catch {}; 151 - 152 - // detect platform from publication basePath (site.standard.* is a lexicon, not a platform) 153 - // known platforms (pckt, leaflet, offprint) use site.standard.* but have distinct basePaths 154 - client.exec( 155 - \\UPDATE documents SET platform = 'pckt' 156 - \\WHERE platform IN ('other', 'unknown') 157 - \\AND publication_uri IN (SELECT uri FROM publications WHERE base_path LIKE '%pckt.blog%') 158 - , &.{}) catch {}; 159 - 160 - client.exec( 161 - \\UPDATE documents SET platform = 'leaflet' 162 - \\WHERE platform IN ('other', 'unknown') 163 - \\AND publication_uri IN (SELECT uri FROM publications WHERE base_path LIKE '%leaflet.pub%') 164 - , &.{}) catch {}; 165 - 166 - client.exec( 167 - \\UPDATE documents SET platform = 'offprint' 168 - \\WHERE platform IN ('other', 'unknown') 169 - \\AND publication_uri IN (SELECT uri FROM publications WHERE base_path LIKE '%offprint.app%' OR base_path LIKE '%offprint.test%') 170 - , &.{}) catch {}; 171 - 172 - client.exec( 173 - \\UPDATE documents SET platform = 'greengale' 174 - \\WHERE platform IN ('other', 'unknown') 175 - \\AND publication_uri IN (SELECT uri FROM publications WHERE base_path LIKE '%greengale.app%') 176 - , &.{}) catch {}; 177 - 178 - // URL path field for documents (e.g., "/001" for zat.dev) 179 - // used to build full URL: publication.url + document.path 180 - client.exec("ALTER TABLE documents ADD COLUMN path TEXT", &.{}) catch {}; 181 - 182 - // denormalized columns for query performance (avoids per-row subqueries) 183 - client.exec("ALTER TABLE documents ADD COLUMN base_path TEXT DEFAULT ''", &.{}) catch {}; 184 - client.exec("ALTER TABLE documents ADD COLUMN has_publication INTEGER DEFAULT 0", &.{}) catch {}; 185 - 186 - // backfill base_path from publications (idempotent - only updates empty values) 187 - client.exec( 188 - \\UPDATE documents SET base_path = COALESCE( 189 - \\ (SELECT p.base_path FROM publications p WHERE p.uri = documents.publication_uri), 190 - \\ (SELECT p.base_path FROM publications p WHERE p.did = documents.did LIMIT 1), 191 - \\ '' 192 - \\) WHERE base_path IS NULL OR base_path = '' 193 - , &.{}) catch {}; 194 - 195 - // backfill has_publication (idempotent) 196 - client.exec( 197 - "UPDATE documents SET has_publication = CASE WHEN publication_uri != '' THEN 1 ELSE 0 END WHERE has_publication = 0 AND publication_uri != ''", 198 - &.{}, 199 - ) catch {}; 200 - 201 - // note: publications_fts was rebuilt with base_path column via scripts/rebuild-pub-fts 202 - // new publications will include base_path via insertPublication in indexer.zig 203 - 204 - // 2026-01-22: clean up stale publication/self records that were deleted from ATProto 205 - // these cause incorrect basePath lookups for greengale documents 206 - // specifically: did:plc:27ivzcszryxp6mehutodmcxo had publication/self with basePath 'greengale.app' 207 - // but that publication was deleted, and the correct one is 'greengale.app/3fz.org' 208 - client.exec( 209 - \\DELETE FROM publications WHERE rkey = 'self' 210 - \\AND base_path = 'greengale.app' 211 - \\AND did = 'did:plc:27ivzcszryxp6mehutodmcxo' 212 - , &.{}) catch {}; 213 - client.exec( 214 - \\DELETE FROM publications_fts WHERE uri IN ( 215 - \\ SELECT 'at://' || did || '/site.standard.publication/self' 216 - \\ FROM publications WHERE rkey = 'self' AND base_path = 'greengale.app' 217 - \\) 218 - , &.{}) catch {}; 219 - 220 - // re-derive basePath for greengale documents that got wrong basePath 221 - // match documents to greengale publications (basePath contains greengale.app) 222 - // prefer more specific basePaths (with subdomain) 223 - client.exec( 224 - \\UPDATE documents SET base_path = ( 225 - \\ SELECT p.base_path FROM publications p 226 - \\ WHERE p.did = documents.did 227 - \\ AND p.base_path LIKE 'greengale.app/%' 228 - \\ ORDER BY LENGTH(p.base_path) DESC 229 - \\ LIMIT 1 230 - \\) 231 - \\WHERE platform = 'greengale' 232 - \\AND (base_path = 'greengale.app' OR base_path LIKE '%pckt.blog%') 233 - \\AND did IN (SELECT did FROM publications WHERE base_path LIKE 'greengale.app/%') 234 - , &.{}) catch {}; 235 131 }

-318

backend/src/db/sync.zig

··· 1 - //! Sync from Turso to local SQLite 2 - //! Full sync on startup, incremental sync periodically 3 - 4 - const std = @import("std"); 5 - const zqlite = @import("zqlite"); 6 - const Allocator = std.mem.Allocator; 7 - const Client = @import("Client.zig"); 8 - const LocalDb = @import("LocalDb.zig"); 9 - 10 - const BATCH_SIZE = 500; 11 - 12 - /// Full sync: fetch all data from Turso and populate local SQLite 13 - pub fn fullSync(turso: *Client, local: *LocalDb) !void { 14 - std.debug.print("sync: starting full sync...\n", .{}); 15 - 16 - local.setReady(false); 17 - 18 - const conn = local.getConn() orelse return error.LocalNotOpen; 19 - 20 - local.lock(); 21 - defer local.unlock(); 22 - 23 - // start transaction for bulk insert 24 - conn.exec("BEGIN IMMEDIATE", .{}) catch |err| { 25 - std.debug.print("sync: failed to begin transaction: {}\n", .{err}); 26 - return err; 27 - }; 28 - errdefer conn.exec("ROLLBACK", .{}) catch {}; 29 - 30 - // clear existing data 31 - conn.exec("DELETE FROM documents_fts", .{}) catch {}; 32 - conn.exec("DELETE FROM documents", .{}) catch {}; 33 - conn.exec("DELETE FROM publications_fts", .{}) catch {}; 34 - conn.exec("DELETE FROM publications", .{}) catch {}; 35 - conn.exec("DELETE FROM document_tags", .{}) catch {}; 36 - 37 - // sync documents in batches 38 - var doc_count: usize = 0; 39 - var offset: usize = 0; 40 - while (true) { 41 - var offset_buf: [16]u8 = undefined; 42 - const offset_str = std.fmt.bufPrint(&offset_buf, "{d}", .{offset}) catch break; 43 - 44 - var result = turso.query( 45 - \\SELECT uri, did, rkey, title, content, created_at, publication_uri, 46 - \\ platform, source_collection, path, base_path, has_publication 47 - \\FROM documents 48 - \\ORDER BY uri 49 - \\LIMIT 500 OFFSET ? 50 - , &.{offset_str}) catch |err| { 51 - std.debug.print("sync: turso query failed: {}\n", .{err}); 52 - break; 53 - }; 54 - defer result.deinit(); 55 - 56 - if (result.rows.len == 0) break; 57 - 58 - for (result.rows) |row| { 59 - insertDocumentLocal(conn, row) catch |err| { 60 - std.debug.print("sync: insert doc failed: {}\n", .{err}); 61 - }; 62 - doc_count += 1; 63 - } 64 - 65 - offset += result.rows.len; 66 - if (offset % 1000 == 0) { 67 - std.debug.print("sync: synced {d} documents...\n", .{offset}); 68 - } 69 - } 70 - 71 - // sync publications 72 - var pub_count: usize = 0; 73 - { 74 - var pub_result = turso.query( 75 - "SELECT uri, did, rkey, name, description, base_path, platform FROM publications", 76 - &.{}, 77 - ) catch |err| { 78 - std.debug.print("sync: turso publications query failed: {}\n", .{err}); 79 - conn.exec("COMMIT", .{}) catch {}; 80 - local.setReady(true); 81 - return; 82 - }; 83 - defer pub_result.deinit(); 84 - 85 - for (pub_result.rows) |row| { 86 - insertPublicationLocal(conn, row) catch |err| { 87 - std.debug.print("sync: insert pub failed: {}\n", .{err}); 88 - }; 89 - pub_count += 1; 90 - } 91 - } 92 - 93 - // sync tags 94 - var tag_count: usize = 0; 95 - { 96 - var tags_result = turso.query( 97 - "SELECT document_uri, tag FROM document_tags", 98 - &.{}, 99 - ) catch |err| { 100 - std.debug.print("sync: turso tags query failed: {}\n", .{err}); 101 - conn.exec("COMMIT", .{}) catch {}; 102 - local.setReady(true); 103 - return; 104 - }; 105 - defer tags_result.deinit(); 106 - 107 - for (tags_result.rows) |row| { 108 - conn.exec( 109 - "INSERT OR IGNORE INTO document_tags (document_uri, tag) VALUES (?, ?)", 110 - .{ row.text(0), row.text(1) }, 111 - ) catch {}; 112 - tag_count += 1; 113 - } 114 - } 115 - 116 - // sync popular searches 117 - var popular_count: usize = 0; 118 - { 119 - conn.exec("DELETE FROM popular_searches", .{}) catch {}; 120 - 121 - var popular_result = turso.query( 122 - "SELECT query, count FROM popular_searches", 123 - &.{}, 124 - ) catch |err| { 125 - std.debug.print("sync: turso popular_searches query failed: {}\n", .{err}); 126 - conn.exec("COMMIT", .{}) catch {}; 127 - local.setReady(true); 128 - return; 129 - }; 130 - defer popular_result.deinit(); 131 - 132 - for (popular_result.rows) |row| { 133 - conn.exec( 134 - "INSERT OR REPLACE INTO popular_searches (query, count) VALUES (?, ?)", 135 - .{ row.text(0), row.text(1) }, 136 - ) catch {}; 137 - popular_count += 1; 138 - } 139 - } 140 - 141 - // record sync time 142 - var ts_buf: [20]u8 = undefined; 143 - const ts_str = std.fmt.bufPrint(&ts_buf, "{d}", .{std.time.timestamp()}) catch "0"; 144 - conn.exec( 145 - "INSERT OR REPLACE INTO sync_meta (key, value) VALUES ('last_sync', ?)", 146 - .{ts_str}, 147 - ) catch {}; 148 - 149 - conn.exec("COMMIT", .{}) catch |err| { 150 - std.debug.print("sync: commit failed: {}\n", .{err}); 151 - return err; 152 - }; 153 - 154 - // checkpoint WAL to prevent unbounded growth 155 - conn.exec("PRAGMA wal_checkpoint(TRUNCATE)", .{}) catch |err| { 156 - std.debug.print("sync: wal checkpoint failed: {}\n", .{err}); 157 - }; 158 - 159 - local.setReady(true); 160 - std.debug.print("sync: full sync complete - {d} docs, {d} pubs, {d} tags, {d} popular\n", .{ doc_count, pub_count, tag_count, popular_count }); 161 - } 162 - 163 - /// Incremental sync: fetch documents created since last sync 164 - pub fn incrementalSync(turso: *Client, local: *LocalDb) !void { 165 - const conn = local.getConn() orelse return error.LocalNotOpen; 166 - 167 - // get last sync time 168 - local.lock(); 169 - const last_sync_ts = blk: { 170 - const row = conn.row( 171 - "SELECT value FROM sync_meta WHERE key = 'last_sync'", 172 - .{}, 173 - ) catch { 174 - local.unlock(); 175 - break :blk @as(i64, 0); 176 - }; 177 - if (row) |r| { 178 - defer r.deinit(); 179 - const val = r.text(0); 180 - local.unlock(); 181 - // empty string (NULL) or invalid -> 0 182 - break :blk if (val.len == 0) 0 else std.fmt.parseInt(i64, val, 10) catch 0; 183 - } 184 - local.unlock(); 185 - break :blk @as(i64, 0); 186 - }; 187 - 188 - if (last_sync_ts == 0) { 189 - // no previous sync, do full sync 190 - std.debug.print("sync: no last_sync found, doing full sync\n", .{}); 191 - return fullSync(turso, local); 192 - } 193 - 194 - // convert timestamp to ISO date for query 195 - // rough estimate: subtract 5 minutes buffer to catch any stragglers 196 - const since_ts = last_sync_ts - 300; 197 - const epoch_secs: u64 = @intCast(since_ts); 198 - const epoch = std.time.epoch.EpochSeconds{ .secs = epoch_secs }; 199 - const day_secs = epoch.getDaySeconds(); 200 - const year_day = epoch.getEpochDay().calculateYearDay(); 201 - const month_day = year_day.calculateMonthDay(); 202 - 203 - var since_buf: [24]u8 = undefined; 204 - const since_str = std.fmt.bufPrint(&since_buf, "{d:0>4}-{d:0>2}-{d:0>2}T{d:0>2}:{d:0>2}:{d:0>2}", .{ 205 - year_day.year, 206 - @intFromEnum(month_day.month), 207 - month_day.day_index + 1, 208 - day_secs.getHoursIntoDay(), 209 - day_secs.getMinutesIntoHour(), 210 - day_secs.getSecondsIntoMinute(), 211 - }) catch { 212 - std.debug.print("sync: failed to format since date\n", .{}); 213 - return; 214 - }; 215 - 216 - std.debug.print("sync: incremental sync since {s}\n", .{since_str}); 217 - 218 - // fetch new documents 219 - var new_docs: usize = 0; 220 - { 221 - var result = turso.query( 222 - \\SELECT uri, did, rkey, title, content, created_at, publication_uri, 223 - \\ platform, source_collection, path, base_path, has_publication 224 - \\FROM documents 225 - \\WHERE created_at >= ? 226 - \\ORDER BY created_at 227 - , &.{since_str}) catch |err| { 228 - std.debug.print("sync: incremental query failed: {}\n", .{err}); 229 - return; 230 - }; 231 - defer result.deinit(); 232 - 233 - local.lock(); 234 - defer local.unlock(); 235 - 236 - for (result.rows) |row| { 237 - insertDocumentLocal(conn, row) catch {}; 238 - new_docs += 1; 239 - } 240 - 241 - // update sync time 242 - var ts_buf: [20]u8 = undefined; 243 - const ts_str = std.fmt.bufPrint(&ts_buf, "{d}", .{std.time.timestamp()}) catch "0"; 244 - conn.exec( 245 - "INSERT OR REPLACE INTO sync_meta (key, value) VALUES ('last_sync', ?)", 246 - .{ts_str}, 247 - ) catch {}; 248 - } 249 - 250 - // periodic WAL checkpoint to prevent unbounded growth 251 - local.lock(); 252 - conn.exec("PRAGMA wal_checkpoint(PASSIVE)", .{}) catch {}; 253 - local.unlock(); 254 - 255 - if (new_docs > 0) { 256 - std.debug.print("sync: incremental sync added {d} new documents\n", .{new_docs}); 257 - } 258 - } 259 - 260 - fn insertDocumentLocal(conn: zqlite.Conn, row: anytype) !void { 261 - // insert into main table 262 - conn.exec( 263 - \\INSERT OR REPLACE INTO documents 264 - \\(uri, did, rkey, title, content, created_at, publication_uri, 265 - \\ platform, source_collection, path, base_path, has_publication) 266 - \\VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?) 267 - , .{ 268 - row.text(0), // uri 269 - row.text(1), // did 270 - row.text(2), // rkey 271 - row.text(3), // title 272 - row.text(4), // content 273 - row.text(5), // created_at 274 - row.text(6), // publication_uri 275 - row.text(7), // platform 276 - row.text(8), // source_collection 277 - row.text(9), // path 278 - row.text(10), // base_path 279 - row.int(11), // has_publication 280 - }) catch |err| { 281 - return err; 282 - }; 283 - 284 - // update FTS 285 - const uri = row.text(0); 286 - conn.exec("DELETE FROM documents_fts WHERE uri = ?", .{uri}) catch {}; 287 - conn.exec( 288 - "INSERT INTO documents_fts (uri, title, content) VALUES (?, ?, ?)", 289 - .{ uri, row.text(3), row.text(4) }, 290 - ) catch {}; 291 - } 292 - 293 - fn insertPublicationLocal(conn: zqlite.Conn, row: anytype) !void { 294 - // insert into main table (no created_at - Turso publications table doesn't have it) 295 - conn.exec( 296 - \\INSERT OR REPLACE INTO publications 297 - \\(uri, did, rkey, name, description, base_path, platform) 298 - \\VALUES (?, ?, ?, ?, ?, ?, ?) 299 - , .{ 300 - row.text(0), // uri 301 - row.text(1), // did 302 - row.text(2), // rkey 303 - row.text(3), // name 304 - row.text(4), // description 305 - row.text(5), // base_path 306 - row.text(6), // platform 307 - }) catch |err| { 308 - return err; 309 - }; 310 - 311 - // update FTS 312 - const uri = row.text(0); 313 - conn.exec("DELETE FROM publications_fts WHERE uri = ?", .{uri}) catch {}; 314 - conn.exec( 315 - "INSERT INTO publications_fts (uri, name, description, base_path) VALUES (?, ?, ?, ?)", 316 - .{ uri, row.text(3), row.text(4), row.text(5) }, 317 - ) catch {}; 318 - }

-311

backend/src/embedder.zig

··· 1 - //! Background worker for generating document embeddings via Voyage AI. 2 - //! 3 - //! Periodically queries for documents missing embeddings, batches them, 4 - //! calls the Voyage API, and updates Turso with the results. 5 - 6 - const std = @import("std"); 7 - const http = std.http; 8 - const json = std.json; 9 - const mem = std.mem; 10 - const posix = std.posix; 11 - const Allocator = mem.Allocator; 12 - const logfire = @import("logfire"); 13 - const db = @import("db/mod.zig"); 14 - 15 - // voyage-3-lite limits 16 - const MAX_BATCH_SIZE = 20; // conservative batch size for reliability 17 - const MAX_CONTENT_CHARS = 8000; // ~2000 tokens, well under 32K limit 18 - const EMBEDDING_DIM = 512; 19 - const POLL_INTERVAL_SECS: u64 = 60; // check for new docs every minute 20 - const ERROR_BACKOFF_SECS: u64 = 300; // 5 min backoff on errors 21 - 22 - /// Start the embedder background worker 23 - pub fn start(allocator: Allocator) void { 24 - const api_key = posix.getenv("VOYAGE_API_KEY") orelse { 25 - logfire.info("embedder: VOYAGE_API_KEY not set, embeddings disabled", .{}); 26 - return; 27 - }; 28 - 29 - const thread = std.Thread.spawn(.{}, worker, .{ allocator, api_key }) catch |err| { 30 - logfire.err("embedder: failed to start thread: {}", .{err}); 31 - return; 32 - }; 33 - thread.detach(); 34 - logfire.info("embedder: background worker started", .{}); 35 - } 36 - 37 - fn worker(allocator: Allocator, api_key: []const u8) void { 38 - // wait for db to be ready 39 - std.Thread.sleep(5 * std.time.ns_per_s); 40 - 41 - var consecutive_errors: u32 = 0; 42 - 43 - while (true) { 44 - const processed = processNextBatch(allocator, api_key) catch |err| { 45 - consecutive_errors += 1; 46 - const backoff: u64 = @min(ERROR_BACKOFF_SECS * consecutive_errors, 3600); 47 - logfire.warn("embedder: error {}, backing off {d}s", .{ err, backoff }); 48 - std.Thread.sleep(backoff * std.time.ns_per_s); 49 - continue; 50 - }; 51 - 52 - if (processed > 0) { 53 - consecutive_errors = 0; 54 - logfire.counter("embedder.documents_processed", @intCast(processed)); 55 - // immediately check for more 56 - continue; 57 - } 58 - 59 - // no work, sleep 60 - consecutive_errors = 0; 61 - std.Thread.sleep(POLL_INTERVAL_SECS * std.time.ns_per_s); 62 - } 63 - } 64 - 65 - const DocToEmbed = struct { 66 - uri: []const u8, 67 - text: []const u8, // title + " " + content (truncated) 68 - }; 69 - 70 - fn processNextBatch(allocator: Allocator, api_key: []const u8) !usize { 71 - const span = logfire.span("embedder.process_batch", .{}); 72 - defer span.end(); 73 - 74 - const client = db.getClient() orelse return error.NoClient; 75 - 76 - // query for documents needing embeddings 77 - var result = try client.query( 78 - "SELECT uri, title, content FROM documents WHERE embedding IS NULL LIMIT ?", 79 - &.{std.fmt.comptimePrint("{}", .{MAX_BATCH_SIZE})}, 80 - ); 81 - defer result.deinit(); 82 - 83 - // collect documents 84 - var docs: std.ArrayList(DocToEmbed) = .empty; 85 - defer { 86 - for (docs.items) |doc| { 87 - allocator.free(doc.text); 88 - } 89 - docs.deinit(allocator); 90 - } 91 - 92 - for (result.rows) |row| { 93 - const uri = row.text(0); 94 - const title = row.text(1); 95 - const content = row.text(2); 96 - 97 - // build text for embedding: title + content, truncated 98 - const text = try buildEmbeddingText(allocator, title, content); 99 - try docs.append(allocator, .{ .uri = uri, .text = text }); 100 - } 101 - 102 - if (docs.items.len == 0) return 0; 103 - 104 - // call Voyage API 105 - const embeddings = try callVoyageApi(allocator, api_key, docs.items); 106 - defer { 107 - for (embeddings) |e| allocator.free(e); 108 - allocator.free(embeddings); 109 - } 110 - 111 - // update Turso with embeddings 112 - for (docs.items, embeddings) |doc, embedding| { 113 - updateDocumentEmbedding(client, doc.uri, embedding) catch |err| { 114 - logfire.err("embedder: failed to update {s}: {}", .{ doc.uri, err }); 115 - }; 116 - } 117 - 118 - return docs.items.len; 119 - } 120 - 121 - fn buildEmbeddingText(allocator: Allocator, title: []const u8, content: []const u8) ![]u8 { 122 - // truncate content if needed 123 - const max_content = MAX_CONTENT_CHARS -| title.len -| 1; 124 - const truncated_content = if (content.len > max_content) content[0..max_content] else content; 125 - 126 - const text = try allocator.alloc(u8, title.len + 1 + truncated_content.len); 127 - @memcpy(text[0..title.len], title); 128 - text[title.len] = ' '; 129 - @memcpy(text[title.len + 1 ..], truncated_content); 130 - 131 - // sanitize to valid UTF-8 (replace invalid bytes with space) 132 - // this ensures json.Stringify treats it as a string, not byte array 133 - sanitizeUtf8(text); 134 - 135 - return text; 136 - } 137 - 138 - fn sanitizeUtf8(text: []u8) void { 139 - var i: usize = 0; 140 - while (i < text.len) { 141 - const len = std.unicode.utf8ByteSequenceLength(text[i]) catch { 142 - text[i] = ' '; // replace invalid start byte 143 - i += 1; 144 - continue; 145 - }; 146 - if (i + len > text.len) { 147 - // truncated sequence at end 148 - text[i] = ' '; 149 - i += 1; 150 - continue; 151 - } 152 - // validate the full sequence 153 - _ = std.unicode.utf8Decode(text[i..][0..len]) catch { 154 - text[i] = ' '; // replace invalid sequence start 155 - i += 1; 156 - continue; 157 - }; 158 - i += len; 159 - } 160 - } 161 - 162 - fn callVoyageApi(allocator: Allocator, api_key: []const u8, docs: []const DocToEmbed) ![][]f32 { 163 - const span = logfire.span("embedder.voyage_api", .{ 164 - .batch_size = @as(i64, @intCast(docs.len)), 165 - }); 166 - defer span.end(); 167 - 168 - var http_client: http.Client = .{ .allocator = allocator }; 169 - defer http_client.deinit(); 170 - 171 - // build request body 172 - const body = try buildVoyageRequest(allocator, docs); 173 - defer allocator.free(body); 174 - 175 - // prepare auth header 176 - var auth_buf: [256]u8 = undefined; 177 - const auth = std.fmt.bufPrint(&auth_buf, "Bearer {s}", .{api_key}) catch 178 - return error.AuthTooLong; 179 - 180 - // make request 181 - var response_body: std.Io.Writer.Allocating = .init(allocator); 182 - errdefer response_body.deinit(); 183 - 184 - const res = http_client.fetch(.{ 185 - .location = .{ .url = "https://api.voyageai.com/v1/embeddings" }, 186 - .method = .POST, 187 - .headers = .{ 188 - .content_type = .{ .override = "application/json" }, 189 - .authorization = .{ .override = auth }, 190 - }, 191 - .payload = body, 192 - .response_writer = &response_body.writer, 193 - }) catch |err| { 194 - logfire.err("embedder: voyage request failed: {}", .{err}); 195 - return error.VoyageRequestFailed; 196 - }; 197 - 198 - if (res.status != .ok) { 199 - const resp_text = response_body.toOwnedSlice() catch ""; 200 - defer if (resp_text.len > 0) allocator.free(resp_text); 201 - logfire.err("embedder: voyage error {}: {s}", .{ res.status, resp_text[0..@min(resp_text.len, 200)] }); 202 - return error.VoyageApiError; 203 - } 204 - 205 - const response_text = try response_body.toOwnedSlice(); 206 - defer allocator.free(response_text); 207 - 208 - return parseVoyageResponse(allocator, response_text, docs.len); 209 - } 210 - 211 - fn buildVoyageRequest(allocator: Allocator, docs: []const DocToEmbed) ![]const u8 { 212 - var body: std.Io.Writer.Allocating = .init(allocator); 213 - errdefer body.deinit(); 214 - var jw: json.Stringify = .{ .writer = &body.writer, .options = .{} }; 215 - 216 - try jw.beginObject(); 217 - 218 - try jw.objectField("model"); 219 - try jw.write("voyage-3-lite"); 220 - 221 - try jw.objectField("input_type"); 222 - try jw.write("document"); 223 - 224 - try jw.objectField("input"); 225 - try jw.beginArray(); 226 - for (docs) |doc| { 227 - try jw.write(doc.text); 228 - } 229 - try jw.endArray(); 230 - 231 - try jw.endObject(); 232 - 233 - return try body.toOwnedSlice(); 234 - } 235 - 236 - fn parseVoyageResponse(allocator: Allocator, response: []const u8, expected_count: usize) ![][]f32 { 237 - const parsed = json.parseFromSlice(json.Value, allocator, response, .{}) catch { 238 - logfire.err("embedder: failed to parse voyage response", .{}); 239 - return error.ParseError; 240 - }; 241 - defer parsed.deinit(); 242 - 243 - const data = parsed.value.object.get("data") orelse return error.MissingData; 244 - if (data != .array) return error.InvalidData; 245 - 246 - if (data.array.items.len != expected_count) { 247 - logfire.err("embedder: expected {d} embeddings, got {d}", .{ expected_count, data.array.items.len }); 248 - return error.CountMismatch; 249 - } 250 - 251 - const embeddings = try allocator.alloc([]f32, expected_count); 252 - errdefer { 253 - for (embeddings) |e| allocator.free(e); 254 - allocator.free(embeddings); 255 - } 256 - 257 - for (data.array.items, 0..) |item, i| { 258 - const embedding_val = item.object.get("embedding") orelse return error.MissingEmbedding; 259 - if (embedding_val != .array) return error.InvalidEmbedding; 260 - 261 - const embedding = try allocator.alloc(f32, EMBEDDING_DIM); 262 - errdefer allocator.free(embedding); 263 - 264 - if (embedding_val.array.items.len != EMBEDDING_DIM) { 265 - std.debug.print("embedder: expected {} dims, got {}\n", .{ EMBEDDING_DIM, embedding_val.array.items.len }); 266 - return error.DimensionMismatch; 267 - } 268 - 269 - for (embedding_val.array.items, 0..) |val, j| { 270 - embedding[j] = switch (val) { 271 - .float => @floatCast(val.float), 272 - .integer => @floatFromInt(val.integer), 273 - else => return error.InvalidValue, 274 - }; 275 - } 276 - embeddings[i] = embedding; 277 - } 278 - 279 - return embeddings; 280 - } 281 - 282 - fn updateDocumentEmbedding(client: *db.Client, uri: []const u8, embedding: []f32) !void { 283 - const allocator = client.allocator; 284 - 285 - // serialize embedding to JSON array string for vector32() 286 - var embedding_json: std.ArrayList(u8) = .empty; 287 - defer embedding_json.deinit(allocator); 288 - 289 - try embedding_json.append(allocator, '['); 290 - for (embedding, 0..) |val, i| { 291 - if (i > 0) try embedding_json.append(allocator, ','); 292 - var buf: [32]u8 = undefined; 293 - const str = std.fmt.bufPrint(&buf, "{d:.6}", .{val}) catch continue; 294 - try embedding_json.appendSlice(allocator, str); 295 - } 296 - try embedding_json.append(allocator, ']'); 297 - 298 - // use batch API to execute dynamic SQL 299 - const statements = [_]db.Client.Statement{ 300 - .{ 301 - .sql = "UPDATE documents SET embedding = vector32(?) WHERE uri = ?", 302 - .args = &.{ embedding_json.items, uri }, 303 - }, 304 - }; 305 - 306 - var result = client.queryBatch(&statements) catch |err| { 307 - std.debug.print("embedder: update failed for {s}: {}\n", .{ uri, err }); 308 - return err; 309 - }; 310 - defer result.deinit(); 311 - }

+36 -81

backend/src/extractor.zig

··· 4 4 const Allocator = mem.Allocator; 5 5 const zat = @import("zat"); 6 6 7 - /// Detected platform from collection name 8 - /// Note: pckt, offprint, and other platforms use site.standard.* collections. 9 - /// Platform detection from collection only distinguishes leaflet (custom lexicon) 10 - /// from site.standard users. Actual platform (pckt/offprint/etc) is detected later 11 - /// from publication basePath. Documents that don't match any known platform are "other". 7 + /// Detected platform from content.$type 12 8 pub const Platform = enum { 13 9 leaflet, 14 - other, // site.standard.* documents not matching a known platform 10 + pckt, 11 + offprint, 15 12 unknown, 16 13 17 - pub fn fromCollection(collection: []const u8) Platform { 18 - if (mem.startsWith(u8, collection, "pub.leaflet.")) return .leaflet; 19 - if (mem.startsWith(u8, collection, "site.standard.")) return .other; 14 + pub fn fromContentType(content_type: []const u8) Platform { 15 + if (mem.startsWith(u8, content_type, "pub.leaflet.")) return .leaflet; 16 + if (mem.startsWith(u8, content_type, "blog.pckt.")) return .pckt; 17 + if (mem.startsWith(u8, content_type, "app.offprint.")) return .offprint; 20 18 return .unknown; 21 19 } 22 20 23 - /// Internal name (for DB storage) 24 21 pub fn name(self: Platform) []const u8 { 25 22 return @tagName(self); 26 23 } 27 - 28 - /// Display name (for UI) 29 - pub fn displayName(self: Platform) []const u8 { 30 - return @tagName(self); 31 - } 32 24 }; 33 25 34 26 /// Extracted document data ready for indexing. 35 - /// Only `content` and `tags` are allocated - other fields borrow from parsed JSON. 27 + /// All string fields are owned by this struct and must be freed via deinit(). 36 28 pub const ExtractedDocument = struct { 37 29 allocator: Allocator, 38 30 title: []const u8, ··· 42 34 tags: [][]const u8, 43 35 platform: Platform, 44 36 source_collection: []const u8, 45 - path: ?[]const u8, // URL path from record (e.g., "/001" for zat.dev) 46 37 47 38 pub fn deinit(self: *ExtractedDocument) void { 48 39 self.allocator.free(self.content); 49 40 self.allocator.free(self.tags); 50 41 } 51 42 52 - /// Transfer ownership of content to caller. Caller must free returned slice. 53 - /// After calling, deinit() will only free tags. 54 - pub fn takeContent(self: *ExtractedDocument) []u8 { 55 - const content = self.content; 56 - self.content = &.{}; 57 - return content; 58 - } 59 - 60 43 /// Platform name as string (for DB storage) 61 44 pub fn platformName(self: ExtractedDocument) []const u8 { 62 45 return self.platform.name(); ··· 71 54 .{ "pub.leaflet.blocks.code", {} }, 72 55 }); 73 56 74 - /// Detect platform from collection name 75 - pub fn detectPlatform(collection: []const u8) Platform { 76 - return Platform.fromCollection(collection); 57 + /// Detect platform from record's content.$type field 58 + pub fn detectPlatform(record: json.ObjectMap) Platform { 59 + const content = record.get("content") orelse return .unknown; 60 + if (content != .object) return .unknown; 61 + 62 + const type_val = content.object.get("$type") orelse return .unknown; 63 + if (type_val != .string) return .unknown; 64 + 65 + return Platform.fromContentType(type_val.string); 77 66 } 78 67 79 68 /// Extract document content from a record. ··· 84 73 collection: []const u8, 85 74 ) !ExtractedDocument { 86 75 const record_val: json.Value = .{ .object = record }; 87 - const platform = detectPlatform(collection); 76 + const platform = detectPlatform(record); 88 77 89 78 // extract required fields 90 79 const title = zat.json.getString(record_val, "title") orelse return error.MissingTitle; ··· 92 81 // extract optional fields 93 82 const created_at = zat.json.getString(record_val, "publishedAt") orelse 94 83 zat.json.getString(record_val, "createdAt"); 95 - 96 - // publication/site can be a string (direct URI) or strongRef object ({uri, cid}) 97 - // zat.json.getString supports paths like "publication.uri" 98 84 const publication_uri = zat.json.getString(record_val, "publication") orelse 99 - zat.json.getString(record_val, "publication.uri") orelse 100 - zat.json.getString(record_val, "site") orelse 101 - zat.json.getString(record_val, "site.uri"); 102 - 103 - // extract URL path (site.standard.document uses "path" field like "/001") 104 - const path = zat.json.getString(record_val, "path"); 85 + zat.json.getString(record_val, "site"); // site.standard uses "site" 105 86 106 87 // extract tags - allocate owned slice 107 88 const tags = try extractTags(allocator, record_val); ··· 119 100 .tags = tags, 120 101 .platform = platform, 121 102 .source_collection = collection, 122 - .path = path, 123 103 }; 124 104 } 125 105 ··· 158 138 try buf.appendSlice(allocator, desc); 159 139 } 160 140 161 - // check for pages at top level (pub.leaflet.document) 162 - // or nested in content object (site.standard.document with pub.leaflet.content) 163 - const pages = zat.json.getArray(record, "pages") orelse 164 - zat.json.getArray(record, "content.pages"); 165 - 166 - if (pages) |p| { 167 - for (p) |page| { 141 + if (zat.json.getArray(record, "pages")) |pages| { 142 + for (pages) |page| { 168 143 if (page == .object) { 169 144 try extractPageContent(allocator, &buf, page.object); 170 145 } ··· 247 222 248 223 // --- tests --- 249 224 250 - test "Platform.fromCollection: leaflet" { 251 - try std.testing.expectEqual(Platform.leaflet, Platform.fromCollection("pub.leaflet.document")); 252 - try std.testing.expectEqual(Platform.leaflet, Platform.fromCollection("pub.leaflet.publication")); 225 + test "Platform.fromContentType: leaflet" { 226 + try std.testing.expectEqual(Platform.leaflet, Platform.fromContentType("pub.leaflet.content")); 227 + try std.testing.expectEqual(Platform.leaflet, Platform.fromContentType("pub.leaflet.blocks.text")); 228 + } 229 + 230 + test "Platform.fromContentType: pckt" { 231 + try std.testing.expectEqual(Platform.pckt, Platform.fromContentType("blog.pckt.content")); 232 + try std.testing.expectEqual(Platform.pckt, Platform.fromContentType("blog.pckt.blocks.whatever")); 253 233 } 254 234 255 - test "Platform.fromCollection: other (site.standard.*)" { 256 - // pckt, offprint, and others use site.standard.* collections 257 - // detected as "other" initially, then corrected by basePath in schema migrations 258 - try std.testing.expectEqual(Platform.other, Platform.fromCollection("site.standard.document")); 259 - try std.testing.expectEqual(Platform.other, Platform.fromCollection("site.standard.publication")); 235 + test "Platform.fromContentType: offprint" { 236 + try std.testing.expectEqual(Platform.offprint, Platform.fromContentType("app.offprint.content")); 260 237 } 261 238 262 - test "Platform.fromCollection: unknown" { 263 - try std.testing.expectEqual(Platform.unknown, Platform.fromCollection("something.else")); 264 - try std.testing.expectEqual(Platform.unknown, Platform.fromCollection("")); 239 + test "Platform.fromContentType: unknown" { 240 + try std.testing.expectEqual(Platform.unknown, Platform.fromContentType("something.else")); 241 + try std.testing.expectEqual(Platform.unknown, Platform.fromContentType("")); 265 242 } 266 243 267 244 test "Platform.name" { 268 245 try std.testing.expectEqualStrings("leaflet", Platform.leaflet.name()); 269 - try std.testing.expectEqualStrings("other", Platform.other.name()); 246 + try std.testing.expectEqualStrings("pckt", Platform.pckt.name()); 247 + try std.testing.expectEqualStrings("offprint", Platform.offprint.name()); 270 248 try std.testing.expectEqualStrings("unknown", Platform.unknown.name()); 271 249 } 272 - 273 - test "Platform.displayName" { 274 - try std.testing.expectEqualStrings("leaflet", Platform.leaflet.displayName()); 275 - try std.testing.expectEqualStrings("other", Platform.other.displayName()); 276 - } 277 - 278 - test "extractDocument: site.standard.document with pub.leaflet.content" { 279 - const allocator = std.testing.allocator; 280 - 281 - // minimal site.standard.document with embedded pub.leaflet.content 282 - const test_json = 283 - \\{"title":"Test Post","content":{"$type":"pub.leaflet.content","pages":[{"id":"page1","$type":"pub.leaflet.pages.linearDocument","blocks":[{"$type":"pub.leaflet.pages.linearDocument#block","block":{"$type":"pub.leaflet.blocks.text","plaintext":"Hello world"}}]}]}} 284 - ; 285 - 286 - const parsed = try json.parseFromSlice(json.Value, allocator, test_json, .{}); 287 - defer parsed.deinit(); 288 - 289 - var doc = try extractDocument(allocator, parsed.value.object, "site.standard.document"); 290 - defer doc.deinit(); 291 - 292 - try std.testing.expectEqualStrings("Test Post", doc.title); 293 - try std.testing.expectEqualStrings("Hello world", doc.content); 294 - }

+5 -129

backend/src/indexer.zig

··· 12 12 tags: []const []const u8, 13 13 platform: []const u8, 14 14 source_collection: []const u8, 15 - path: ?[]const u8, 16 15 ) !void { 17 16 const c = db.getClient() orelse return error.NotInitialized; 18 17 19 - // dedupe: if (did, rkey) exists with different uri, clean up old record first 20 - // this handles cross-collection duplicates (e.g., pub.leaflet.document + site.standard.document) 21 - if (c.query("SELECT uri FROM documents WHERE did = ? AND rkey = ?", &.{ did, rkey })) |result_val| { 22 - var result = result_val; 23 - defer result.deinit(); 24 - if (result.first()) |row| { 25 - const old_uri = row.text(0); 26 - if (!std.mem.eql(u8, old_uri, uri)) { 27 - c.exec("DELETE FROM documents_fts WHERE uri = ?", &.{old_uri}) catch {}; 28 - c.exec("DELETE FROM document_tags WHERE document_uri = ?", &.{old_uri}) catch {}; 29 - c.exec("DELETE FROM documents WHERE uri = ?", &.{old_uri}) catch {}; 30 - } 31 - } 32 - } else |_| {} 33 - 34 - // compute denormalized fields 35 - const pub_uri = publication_uri orelse ""; 36 - const has_pub: []const u8 = if (pub_uri.len > 0) "1" else "0"; 37 - 38 - // look up base_path from publication (or fallback to DID lookup) 39 - // use a stack buffer because row.text() returns a slice into result memory 40 - // which gets freed by result.deinit() 41 - var base_path_buf: [256]u8 = undefined; 42 - var base_path: []const u8 = ""; 43 - 44 - if (pub_uri.len > 0) { 45 - if (c.query("SELECT base_path FROM publications WHERE uri = ?", &.{pub_uri})) |res| { 46 - var result = res; 47 - defer result.deinit(); 48 - if (result.first()) |row| { 49 - const val = row.text(0); 50 - if (val.len > 0 and val.len <= base_path_buf.len) { 51 - @memcpy(base_path_buf[0..val.len], val); 52 - base_path = base_path_buf[0..val.len]; 53 - } 54 - } 55 - } else |_| {} 56 - } 57 - // fallback: find publication by DID, preferring platform-specific matches 58 - if (base_path.len == 0) { 59 - // try platform-specific publication first 60 - const platform_pattern: []const u8 = if (std.mem.eql(u8, platform, "greengale")) 61 - "%greengale.app%" 62 - else if (std.mem.eql(u8, platform, "pckt")) 63 - "%pckt.blog%" 64 - else if (std.mem.eql(u8, platform, "offprint")) 65 - "%offprint.app%" 66 - else if (std.mem.eql(u8, platform, "leaflet")) 67 - "%leaflet.pub%" 68 - else 69 - "%"; 70 - 71 - if (c.query("SELECT base_path FROM publications WHERE did = ? AND base_path LIKE ? ORDER BY LENGTH(base_path) DESC LIMIT 1", &.{ did, platform_pattern })) |res| { 72 - var result = res; 73 - defer result.deinit(); 74 - if (result.first()) |row| { 75 - const val = row.text(0); 76 - if (val.len > 0 and val.len <= base_path_buf.len) { 77 - @memcpy(base_path_buf[0..val.len], val); 78 - base_path = base_path_buf[0..val.len]; 79 - } 80 - } 81 - } else |_| {} 82 - 83 - // if no platform-specific match, fall back to any publication 84 - if (base_path.len == 0) { 85 - if (c.query("SELECT base_path FROM publications WHERE did = ? ORDER BY LENGTH(base_path) DESC LIMIT 1", &.{did})) |res| { 86 - var result = res; 87 - defer result.deinit(); 88 - if (result.first()) |row| { 89 - const val = row.text(0); 90 - if (val.len > 0 and val.len <= base_path_buf.len) { 91 - @memcpy(base_path_buf[0..val.len], val); 92 - base_path = base_path_buf[0..val.len]; 93 - } 94 - } 95 - } else |_| {} 96 - } 97 - } 98 - 99 - // detect platform from basePath if platform is unknown/other 100 - // this handles site.standard.* documents where collection doesn't indicate platform 101 - var actual_platform = platform; 102 - if (std.mem.eql(u8, platform, "unknown") or std.mem.eql(u8, platform, "other")) { 103 - if (std.mem.indexOf(u8, base_path, "leaflet.pub") != null) { 104 - actual_platform = "leaflet"; 105 - } else if (std.mem.indexOf(u8, base_path, "pckt.blog") != null) { 106 - actual_platform = "pckt"; 107 - } else if (std.mem.indexOf(u8, base_path, "offprint.app") != null) { 108 - actual_platform = "offprint"; 109 - } else if (std.mem.indexOf(u8, base_path, "greengale.app") != null) { 110 - actual_platform = "greengale"; 111 - } 112 - } 113 - 114 - // use ON CONFLICT to preserve embedding column (INSERT OR REPLACE would nuke it) 115 18 try c.exec( 116 - \\INSERT INTO documents (uri, did, rkey, title, content, created_at, publication_uri, platform, source_collection, path, base_path, has_publication) 117 - \\VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?) 118 - \\ON CONFLICT(uri) DO UPDATE SET 119 - \\ did = excluded.did, 120 - \\ rkey = excluded.rkey, 121 - \\ title = excluded.title, 122 - \\ content = excluded.content, 123 - \\ created_at = excluded.created_at, 124 - \\ publication_uri = excluded.publication_uri, 125 - \\ platform = excluded.platform, 126 - \\ source_collection = excluded.source_collection, 127 - \\ path = excluded.path, 128 - \\ base_path = excluded.base_path, 129 - \\ has_publication = excluded.has_publication 130 - , 131 - &.{ uri, did, rkey, title, content, created_at orelse "", pub_uri, actual_platform, source_collection, path orelse "", base_path, has_pub }, 19 + "INSERT OR REPLACE INTO documents (uri, did, rkey, title, content, created_at, publication_uri, platform, source_collection) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?)", 20 + &.{ uri, did, rkey, title, content, created_at orelse "", publication_uri orelse "", platform, source_collection }, 132 21 ); 133 22 134 23 // update FTS index ··· 158 47 ) !void { 159 48 const c = db.getClient() orelse return error.NotInitialized; 160 49 161 - // dedupe: if (did, rkey) exists with different uri, clean up old record first 162 - if (c.query("SELECT uri FROM publications WHERE did = ? AND rkey = ?", &.{ did, rkey })) |result_val| { 163 - var result = result_val; 164 - defer result.deinit(); 165 - if (result.first()) |row| { 166 - const old_uri = row.text(0); 167 - if (!std.mem.eql(u8, old_uri, uri)) { 168 - c.exec("DELETE FROM publications_fts WHERE uri = ?", &.{old_uri}) catch {}; 169 - c.exec("DELETE FROM publications WHERE uri = ?", &.{old_uri}) catch {}; 170 - } 171 - } 172 - } else |_| {} 173 - 174 50 try c.exec( 175 51 "INSERT OR REPLACE INTO publications (uri, did, rkey, name, description, base_path) VALUES (?, ?, ?, ?, ?, ?)", 176 52 &.{ uri, did, rkey, name, description orelse "", base_path orelse "" }, 177 53 ); 178 54 179 - // update FTS index (includes base_path for subdomain search) 55 + // update FTS index 180 56 c.exec("DELETE FROM publications_fts WHERE uri = ?", &.{uri}) catch {}; 181 57 c.exec( 182 - "INSERT INTO publications_fts (uri, name, description, base_path) VALUES (?, ?, ?, ?)", 183 - &.{ uri, name, description orelse "", base_path orelse "" }, 58 + "INSERT INTO publications_fts (uri, name, description) VALUES (?, ?, ?)", 59 + &.{ uri, name, description orelse "" }, 184 60 ) catch {}; 185 61 } 186 62

+21 -45

backend/src/main.zig

··· 2 2 const net = std.net; 3 3 const posix = std.posix; 4 4 const Thread = std.Thread; 5 - const logfire = @import("logfire"); 6 5 const db = @import("db/mod.zig"); 7 6 const activity = @import("activity.zig"); 8 7 const server = @import("server.zig"); 9 8 const tap = @import("tap.zig"); 10 - const embedder = @import("embedder.zig"); 11 9 12 10 const MAX_HTTP_WORKERS = 16; 13 11 const SOCKET_TIMEOUT_SECS = 30; ··· 17 15 defer _ = gpa.deinit(); 18 16 const allocator = gpa.allocator(); 19 17 20 - // configure logfire (reads LOGFIRE_WRITE_TOKEN from env) 21 - _ = logfire.configure(.{ 22 - .service_name = "leaflet-search", 23 - .service_version = "0.1.0", 24 - .environment = posix.getenv("FLY_APP_NAME") orelse "development", 25 - }) catch |err| { 26 - std.debug.print("logfire init failed: {}, continuing without observability\n", .{err}); 27 - }; 18 + // init turso 19 + try db.init(); 28 20 29 - // start http server FIRST so Fly proxy doesn't timeout 30 - const port: u16 = blk: { 31 - const port_str = posix.getenv("PORT") orelse "3000"; 32 - break :blk std.fmt.parseInt(u16, port_str, 10) catch 3000; 33 - }; 21 + // start activity tracker 22 + activity.init(); 34 23 35 - const address = try net.Address.parseIp("0.0.0.0", port); 36 - var listener = try address.listen(.{ .reuse_address = true }); 37 - defer listener.deinit(); 38 - 39 - const app_name = posix.getenv("APP_NAME") orelse "leaflet-search"; 40 - logfire.info("{s} listening on port {d} (max {d} workers)", .{ app_name, port, MAX_HTTP_WORKERS }); 41 - 42 - // init turso client synchronously (fast, needed for search fallback) 43 - try db.initTurso(); 24 + // start tap consumer in background 25 + const tap_thread = try Thread.spawn(.{}, tap.consumer, .{allocator}); 26 + defer tap_thread.join(); 44 27 45 28 // init thread pool for http connections 46 29 var pool: Thread.Pool = undefined; ··· 50 33 }); 51 34 defer pool.deinit(); 52 35 53 - // init local db and other services in background (slow) 54 - const init_thread = try Thread.spawn(.{}, initServices, .{allocator}); 55 - init_thread.detach(); 36 + // start http server 37 + const port: u16 = blk: { 38 + const port_str = posix.getenv("PORT") orelse "3000"; 39 + break :blk std.fmt.parseInt(u16, port_str, 10) catch 3000; 40 + }; 41 + 42 + const address = try net.Address.parseIp("0.0.0.0", port); 43 + var listener = try address.listen(.{ .reuse_address = true }); 44 + defer listener.deinit(); 45 + 46 + std.debug.print("leaflet-search listening on http://0.0.0.0:{d} (max {} workers)\n", .{ port, MAX_HTTP_WORKERS }); 56 47 57 48 while (true) { 58 49 const conn = listener.accept() catch |err| { 59 - logfire.err("accept error: {}", .{err}); 50 + std.debug.print("accept error: {}\n", .{err}); 60 51 continue; 61 52 }; 62 53 63 54 setSocketTimeout(conn.stream.handle, SOCKET_TIMEOUT_SECS) catch |err| { 64 - logfire.warn("failed to set socket timeout: {}", .{err}); 55 + std.debug.print("failed to set socket timeout: {}\n", .{err}); 65 56 }; 66 57 67 58 pool.spawn(server.handleConnection, .{conn}) catch |err| { 68 - logfire.err("pool spawn error: {}", .{err}); 59 + std.debug.print("pool spawn error: {}\n", .{err}); 69 60 conn.stream.close(); 70 61 }; 71 62 } 72 - } 73 - 74 - fn initServices(allocator: std.mem.Allocator) void { 75 - // init local db (slow - turso already initialized) 76 - db.initLocalDb(); 77 - db.startSync(); 78 - 79 - // start activity tracker 80 - activity.init(); 81 - 82 - // start embedder (generates embeddings for new docs) 83 - embedder.start(allocator); 84 - 85 - // start tap consumer 86 - tap.consumer(allocator); 87 63 } 88 64 89 65 fn setSocketTimeout(fd: posix.fd_t, secs: u32) !void {

+50 -433

backend/src/search.zig

··· 16 16 rkey: []const u8, 17 17 basePath: []const u8, 18 18 platform: []const u8, 19 - path: []const u8 = "", // URL path from record (e.g., "/001") 20 19 }; 21 20 22 21 /// Document search result (internal) ··· 30 29 basePath: []const u8, 31 30 hasPublication: bool, 32 31 platform: []const u8, 33 - path: []const u8, 34 32 35 33 fn fromRow(row: db.Row) Doc { 36 34 return .{ ··· 43 41 .basePath = row.text(6), 44 42 .hasPublication = row.int(7) != 0, 45 43 .platform = row.text(8), 46 - .path = row.text(9), 47 - }; 48 - } 49 - 50 - fn fromLocalRow(row: db.LocalDb.Row) Doc { 51 - return .{ 52 - .uri = row.text(0), 53 - .did = row.text(1), 54 - .title = row.text(2), 55 - .snippet = row.text(3), 56 - .createdAt = row.text(4), 57 - .rkey = row.text(5), 58 - .basePath = row.text(6), 59 - .hasPublication = row.int(7) != 0, 60 - .platform = row.text(8), 61 - .path = row.text(9), 62 44 }; 63 45 } 64 46 ··· 73 55 .rkey = self.rkey, 74 56 .basePath = self.basePath, 75 57 .platform = self.platform, 76 - .path = self.path, 77 58 }; 78 59 } 79 60 }; 80 61 81 62 const DocsByTag = zql.Query( 82 63 \\SELECT d.uri, d.did, d.title, '' as snippet, 83 - \\ d.created_at, d.rkey, d.base_path, d.has_publication, 84 - \\ d.platform, COALESCE(d.path, '') as path 64 + \\ d.created_at, d.rkey, COALESCE(p.base_path, '') as base_path, 65 + \\ CASE WHEN d.publication_uri != '' THEN 1 ELSE 0 END as has_publication, 66 + \\ d.platform 85 67 \\FROM documents d 68 + \\LEFT JOIN publications p ON d.publication_uri = p.uri 86 69 \\JOIN document_tags dt ON d.uri = dt.document_uri 87 70 \\WHERE dt.tag = :tag 88 71 \\ORDER BY d.created_at DESC LIMIT 40 ··· 91 74 const DocsByFtsAndTag = zql.Query( 92 75 \\SELECT f.uri, d.did, d.title, 93 76 \\ snippet(documents_fts, 2, '', '', '...', 32) as snippet, 94 - \\ d.created_at, d.rkey, d.base_path, d.has_publication, 95 - \\ d.platform, COALESCE(d.path, '') as path 77 + \\ d.created_at, d.rkey, COALESCE(p.base_path, '') as base_path, 78 + \\ CASE WHEN d.publication_uri != '' THEN 1 ELSE 0 END as has_publication, 79 + \\ d.platform 96 80 \\FROM documents_fts f 97 81 \\JOIN documents d ON f.uri = d.uri 82 + \\LEFT JOIN publications p ON d.publication_uri = p.uri 98 83 \\JOIN document_tags dt ON d.uri = dt.document_uri 99 84 \\WHERE documents_fts MATCH :query AND dt.tag = :tag 100 - \\ORDER BY rank + (julianday('now') - julianday(d.created_at)) / 30.0 LIMIT 40 85 + \\ORDER BY rank LIMIT 40 101 86 ); 102 87 103 88 const DocsByFts = zql.Query( 104 89 \\SELECT f.uri, d.did, d.title, 105 90 \\ snippet(documents_fts, 2, '', '', '...', 32) as snippet, 106 - \\ d.created_at, d.rkey, d.base_path, d.has_publication, 107 - \\ d.platform, COALESCE(d.path, '') as path 91 + \\ d.created_at, d.rkey, COALESCE(p.base_path, '') as base_path, 92 + \\ CASE WHEN d.publication_uri != '' THEN 1 ELSE 0 END as has_publication, 93 + \\ d.platform 108 94 \\FROM documents_fts f 109 95 \\JOIN documents d ON f.uri = d.uri 96 + \\LEFT JOIN publications p ON d.publication_uri = p.uri 110 97 \\WHERE documents_fts MATCH :query 111 - \\ORDER BY rank + (julianday('now') - julianday(d.created_at)) / 30.0 LIMIT 40 112 - ); 113 - 114 - const DocsByFtsAndSince = zql.Query( 115 - \\SELECT f.uri, d.did, d.title, 116 - \\ snippet(documents_fts, 2, '', '', '...', 32) as snippet, 117 - \\ d.created_at, d.rkey, d.base_path, d.has_publication, 118 - \\ d.platform, COALESCE(d.path, '') as path 119 - \\FROM documents_fts f 120 - \\JOIN documents d ON f.uri = d.uri 121 - \\WHERE documents_fts MATCH :query AND d.created_at >= :since 122 - \\ORDER BY rank + (julianday('now') - julianday(d.created_at)) / 30.0 LIMIT 40 123 - ); 124 - 125 - const DocsByFtsAndPlatform = zql.Query( 126 - \\SELECT f.uri, d.did, d.title, 127 - \\ snippet(documents_fts, 2, '', '', '...', 32) as snippet, 128 - \\ d.created_at, d.rkey, d.base_path, d.has_publication, 129 - \\ d.platform, COALESCE(d.path, '') as path 130 - \\FROM documents_fts f 131 - \\JOIN documents d ON f.uri = d.uri 132 - \\WHERE documents_fts MATCH :query AND d.platform = :platform 133 - \\ORDER BY rank + (julianday('now') - julianday(d.created_at)) / 30.0 LIMIT 40 134 - ); 135 - 136 - const DocsByFtsAndPlatformAndSince = zql.Query( 137 - \\SELECT f.uri, d.did, d.title, 138 - \\ snippet(documents_fts, 2, '', '', '...', 32) as snippet, 139 - \\ d.created_at, d.rkey, d.base_path, d.has_publication, 140 - \\ d.platform, COALESCE(d.path, '') as path 141 - \\FROM documents_fts f 142 - \\JOIN documents d ON f.uri = d.uri 143 - \\WHERE documents_fts MATCH :query AND d.platform = :platform AND d.created_at >= :since 144 - \\ORDER BY rank + (julianday('now') - julianday(d.created_at)) / 30.0 LIMIT 40 145 - ); 146 - 147 - const DocsByTagAndPlatform = zql.Query( 148 - \\SELECT d.uri, d.did, d.title, '' as snippet, 149 - \\ d.created_at, d.rkey, d.base_path, d.has_publication, 150 - \\ d.platform, COALESCE(d.path, '') as path 151 - \\FROM documents d 152 - \\JOIN document_tags dt ON d.uri = dt.document_uri 153 - \\WHERE dt.tag = :tag AND d.platform = :platform 154 - \\ORDER BY d.created_at DESC LIMIT 40 155 - ); 156 - 157 - const DocsByFtsAndTagAndPlatform = zql.Query( 158 - \\SELECT f.uri, d.did, d.title, 159 - \\ snippet(documents_fts, 2, '', '', '...', 32) as snippet, 160 - \\ d.created_at, d.rkey, d.base_path, d.has_publication, 161 - \\ d.platform, COALESCE(d.path, '') as path 162 - \\FROM documents_fts f 163 - \\JOIN documents d ON f.uri = d.uri 164 - \\JOIN document_tags dt ON d.uri = dt.document_uri 165 - \\WHERE documents_fts MATCH :query AND dt.tag = :tag AND d.platform = :platform 166 - \\ORDER BY rank + (julianday('now') - julianday(d.created_at)) / 30.0 LIMIT 40 167 - ); 168 - 169 - const DocsByPlatform = zql.Query( 170 - \\SELECT d.uri, d.did, d.title, '' as snippet, 171 - \\ d.created_at, d.rkey, d.base_path, d.has_publication, 172 - \\ d.platform, COALESCE(d.path, '') as path 173 - \\FROM documents d 174 - \\WHERE d.platform = :platform 175 - \\ORDER BY d.created_at DESC LIMIT 40 176 - ); 177 - 178 - // Find documents by their publication's base_path (subdomain search) 179 - // e.g., searching "gyst" finds all docs on gyst.leaflet.pub 180 - // Uses recency decay: recent docs rank higher than old ones with same match 181 - const DocsByPubBasePath = zql.Query( 182 - \\SELECT d.uri, d.did, d.title, '' as snippet, 183 - \\ d.created_at, d.rkey, 184 - \\ p.base_path, 185 - \\ 1 as has_publication, 186 - \\ d.platform, COALESCE(d.path, '') as path 187 - \\FROM documents d 188 - \\JOIN publications p ON d.publication_uri = p.uri 189 - \\JOIN publications_fts pf ON p.uri = pf.uri 190 - \\WHERE publications_fts MATCH :query 191 - \\ORDER BY rank + (julianday('now') - julianday(d.created_at)) / 30.0 LIMIT 40 192 - ); 193 - 194 - const DocsByPubBasePathAndPlatform = zql.Query( 195 - \\SELECT d.uri, d.did, d.title, '' as snippet, 196 - \\ d.created_at, d.rkey, 197 - \\ p.base_path, 198 - \\ 1 as has_publication, 199 - \\ d.platform, COALESCE(d.path, '') as path 200 - \\FROM documents d 201 - \\JOIN publications p ON d.publication_uri = p.uri 202 - \\JOIN publications_fts pf ON p.uri = pf.uri 203 - \\WHERE publications_fts MATCH :query AND d.platform = :platform 204 - \\ORDER BY rank + (julianday('now') - julianday(d.created_at)) / 30.0 LIMIT 40 98 + \\ORDER BY rank LIMIT 40 205 99 ); 206 100 207 101 /// Publication search result (internal) ··· 212 106 snippet: []const u8, 213 107 rkey: []const u8, 214 108 basePath: []const u8, 215 - platform: []const u8, 216 109 217 110 fn fromRow(row: db.Row) Pub { 218 111 return .{ ··· 222 115 .snippet = row.text(3), 223 116 .rkey = row.text(4), 224 117 .basePath = row.text(5), 225 - .platform = row.text(6), 226 - }; 227 - } 228 - 229 - fn fromLocalRow(row: db.LocalDb.Row) Pub { 230 - return .{ 231 - .uri = row.text(0), 232 - .did = row.text(1), 233 - .name = row.text(2), 234 - .snippet = row.text(3), 235 - .rkey = row.text(4), 236 - .basePath = row.text(5), 237 - .platform = row.text(6), 238 118 }; 239 119 } 240 120 ··· 247 127 .snippet = self.snippet, 248 128 .rkey = self.rkey, 249 129 .basePath = self.basePath, 250 - .platform = self.platform, 130 + .platform = "leaflet", // publications are leaflet-only for now 251 131 }; 252 132 } 253 133 }; ··· 255 135 const PubSearch = zql.Query( 256 136 \\SELECT f.uri, p.did, p.name, 257 137 \\ snippet(publications_fts, 2, '', '', '...', 32) as snippet, 258 - \\ p.rkey, p.base_path, p.platform 138 + \\ p.rkey, p.base_path 259 139 \\FROM publications_fts f 260 140 \\JOIN publications p ON f.uri = p.uri 261 141 \\WHERE publications_fts MATCH :query 262 - \\ORDER BY rank + (julianday('now') - julianday(p.created_at)) / 30.0 LIMIT 10 142 + \\ORDER BY rank LIMIT 10 263 143 ); 264 144 265 - pub fn search(alloc: Allocator, query: []const u8, tag_filter: ?[]const u8, platform_filter: ?[]const u8, since_filter: ?[]const u8) ![]const u8 { 266 - // try local SQLite first (faster for FTS queries) 267 - if (db.getLocalDb()) |local| { 268 - if (searchLocal(alloc, local, query, tag_filter, platform_filter, since_filter)) |result| { 269 - return result; 270 - } else |err| { 271 - std.debug.print("local search failed ({s}), falling back to turso\n", .{@errorName(err)}); 272 - } 273 - } 274 - 275 - // fall back to Turso 145 + pub fn search(alloc: Allocator, query: []const u8, tag_filter: ?[]const u8, platform_filter: ?[]const u8) ![]const u8 { 276 146 const c = db.getClient() orelse return error.NotInitialized; 277 147 278 148 var output: std.Io.Writer.Allocating = .init(alloc); ··· 282 152 try jw.beginArray(); 283 153 284 154 const fts_query = try buildFtsQuery(alloc, query); 285 - const has_query = query.len > 0; 286 - const has_tag = tag_filter != null; 287 - const has_platform = platform_filter != null; 288 - const has_since = since_filter != null; 289 155 290 - // track seen URIs for deduplication (content match + base_path match) 291 - var seen_uris = std.StringHashMap(void).init(alloc); 292 - defer seen_uris.deinit(); 156 + // search documents 157 + var doc_result = if (query.len == 0 and tag_filter != null) 158 + c.query(DocsByTag.positional, DocsByTag.bind(.{ .tag = tag_filter.? })) catch null 159 + else if (tag_filter) |tag| 160 + c.query(DocsByFtsAndTag.positional, DocsByFtsAndTag.bind(.{ .query = fts_query, .tag = tag })) catch null 161 + else 162 + c.query(DocsByFts.positional, DocsByFts.bind(.{ .query = fts_query })) catch null; 293 163 294 - // build batch of queries to execute in single HTTP request 295 - var statements: [3]db.Client.Statement = undefined; 296 - var stmt_count: usize = 0; 297 - 298 - // query 0: documents by content (always present if we have any filter) 299 - const doc_sql = getDocQuerySql(has_query, has_tag, has_platform, has_since); 300 - const doc_args = try getDocQueryArgs(alloc, fts_query, tag_filter, platform_filter, since_filter, has_query, has_tag, has_platform, has_since); 301 - if (doc_sql) |sql| { 302 - statements[stmt_count] = .{ .sql = sql, .args = doc_args }; 303 - stmt_count += 1; 304 - } 305 - 306 - // query 1: documents by publication base_path (subdomain search) 307 - const run_basepath = has_query and !has_tag; 308 - if (run_basepath) { 309 - if (has_platform) { 310 - statements[stmt_count] = .{ .sql = DocsByPubBasePathAndPlatform.positional, .args = &.{ fts_query, platform_filter.? } }; 311 - } else { 312 - statements[stmt_count] = .{ .sql = DocsByPubBasePath.positional, .args = &.{fts_query} }; 313 - } 314 - stmt_count += 1; 315 - } 316 - 317 - // query 2: publications (only when no tag/platform filter) 318 - const run_pubs = tag_filter == null and platform_filter == null and has_query; 319 - if (run_pubs) { 320 - statements[stmt_count] = .{ .sql = PubSearch.positional, .args = &.{fts_query} }; 321 - stmt_count += 1; 322 - } 323 - 324 - if (stmt_count == 0) { 325 - try jw.endArray(); 326 - return try output.toOwnedSlice(); 327 - } 328 - 329 - // execute all queries in single HTTP request 330 - var batch = c.queryBatch(statements[0..stmt_count]) catch { 331 - try jw.endArray(); 332 - return try output.toOwnedSlice(); 333 - }; 334 - defer batch.deinit(); 335 - 336 - // process query 0: document content results 337 - var query_idx: usize = 0; 338 - if (doc_sql != null) { 339 - for (batch.get(query_idx)) |row| { 164 + if (doc_result) |*res| { 165 + defer res.deinit(); 166 + for (res.rows) |row| { 340 167 const doc = Doc.fromRow(row); 341 - const uri_dupe = try alloc.dupe(u8, doc.uri); 342 - try seen_uris.put(uri_dupe, {}); 168 + // filter by platform if specified 169 + if (platform_filter) |pf| { 170 + if (!std.mem.eql(u8, doc.platform, pf)) continue; 171 + } 343 172 try jw.write(doc.toJson()); 344 173 } 345 - query_idx += 1; 346 174 } 347 175 348 - // process query 1: base_path results (deduplicated) 349 - if (run_basepath) { 350 - for (batch.get(query_idx)) |row| { 351 - const doc = Doc.fromRow(row); 352 - if (!seen_uris.contains(doc.uri)) { 353 - try jw.write(doc.toJson()); 354 - } 355 - } 356 - query_idx += 1; 357 - } 176 + // publications are excluded when filtering by tag or platform (only leaflet has publications) 177 + if (tag_filter == null and (platform_filter == null or std.mem.eql(u8, platform_filter.?, "leaflet"))) { 178 + var pub_result = c.query( 179 + PubSearch.positional, 180 + PubSearch.bind(.{ .query = fts_query }), 181 + ) catch null; 358 182 359 - // process query 2: publication results 360 - if (run_pubs) { 361 - for (batch.get(query_idx)) |row| { 362 - try jw.write(Pub.fromRow(row).toJson()); 183 + if (pub_result) |*res| { 184 + defer res.deinit(); 185 + for (res.rows) |row| try jw.write(Pub.fromRow(row).toJson()); 363 186 } 364 187 } 365 188 ··· 367 190 return try output.toOwnedSlice(); 368 191 } 369 192 370 - /// Local SQLite search (FTS queries only, no vector similarity) 371 - /// Simplified version - just handles basic FTS query case to get started 372 - fn searchLocal(alloc: Allocator, local: *db.LocalDb, query: []const u8, tag_filter: ?[]const u8, platform_filter: ?[]const u8, since_filter: ?[]const u8) ![]const u8 { 373 - // only handle basic FTS queries for now (most common case) 374 - // fall back to Turso for complex filter combinations 375 - if (query.len == 0 or tag_filter != null or since_filter != null) { 376 - return error.UnsupportedQuery; 377 - } 378 - 379 - var output: std.Io.Writer.Allocating = .init(alloc); 380 - errdefer output.deinit(); 381 - 382 - var jw: json.Stringify = .{ .writer = &output.writer }; 383 - try jw.beginArray(); 384 - 385 - const fts_query = try buildFtsQuery(alloc, query); 386 - 387 - // track seen URIs for deduplication 388 - var seen_uris = std.StringHashMap(void).init(alloc); 389 - defer seen_uris.deinit(); 390 - 391 - // document content search 392 - if (platform_filter) |platform| { 393 - var rows = try local.query( 394 - \\SELECT f.uri, d.did, d.title, 395 - \\ snippet(documents_fts, 2, '', '', '...', 32) as snippet, 396 - \\ d.created_at, d.rkey, d.base_path, d.has_publication, 397 - \\ d.platform, COALESCE(d.path, '') as path 398 - \\FROM documents_fts f 399 - \\JOIN documents d ON f.uri = d.uri 400 - \\WHERE documents_fts MATCH ? AND d.platform = ? 401 - \\ORDER BY rank LIMIT 40 402 - , .{ fts_query, platform }); 403 - defer rows.deinit(); 404 - 405 - while (rows.next()) |row| { 406 - const doc = Doc.fromLocalRow(row); 407 - const uri_dupe = try alloc.dupe(u8, doc.uri); 408 - try seen_uris.put(uri_dupe, {}); 409 - try jw.write(doc.toJson()); 410 - } 411 - 412 - // base_path search with platform 413 - var bp_rows = try local.query( 414 - \\SELECT d.uri, d.did, d.title, '' as snippet, 415 - \\ d.created_at, d.rkey, p.base_path, 416 - \\ 1 as has_publication, d.platform, COALESCE(d.path, '') as path 417 - \\FROM documents d 418 - \\JOIN publications p ON d.publication_uri = p.uri 419 - \\JOIN publications_fts pf ON p.uri = pf.uri 420 - \\WHERE publications_fts MATCH ? AND d.platform = ? 421 - \\ORDER BY rank LIMIT 40 422 - , .{ fts_query, platform }); 423 - defer bp_rows.deinit(); 424 - 425 - while (bp_rows.next()) |row| { 426 - const doc = Doc.fromLocalRow(row); 427 - if (!seen_uris.contains(doc.uri)) { 428 - try jw.write(doc.toJson()); 429 - } 430 - } 431 - } else { 432 - // no platform filter 433 - var rows = try local.query( 434 - \\SELECT f.uri, d.did, d.title, 435 - \\ snippet(documents_fts, 2, '', '', '...', 32) as snippet, 436 - \\ d.created_at, d.rkey, d.base_path, d.has_publication, 437 - \\ d.platform, COALESCE(d.path, '') as path 438 - \\FROM documents_fts f 439 - \\JOIN documents d ON f.uri = d.uri 440 - \\WHERE documents_fts MATCH ? 441 - \\ORDER BY rank LIMIT 40 442 - , .{fts_query}); 443 - defer rows.deinit(); 444 - 445 - while (rows.next()) |row| { 446 - const doc = Doc.fromLocalRow(row); 447 - const uri_dupe = try alloc.dupe(u8, doc.uri); 448 - try seen_uris.put(uri_dupe, {}); 449 - try jw.write(doc.toJson()); 450 - } 451 - 452 - // base_path search 453 - var bp_rows = try local.query( 454 - \\SELECT d.uri, d.did, d.title, '' as snippet, 455 - \\ d.created_at, d.rkey, p.base_path, 456 - \\ 1 as has_publication, d.platform, COALESCE(d.path, '') as path 457 - \\FROM documents d 458 - \\JOIN publications p ON d.publication_uri = p.uri 459 - \\JOIN publications_fts pf ON p.uri = pf.uri 460 - \\WHERE publications_fts MATCH ? 461 - \\ORDER BY rank LIMIT 40 462 - , .{fts_query}); 463 - defer bp_rows.deinit(); 464 - 465 - while (bp_rows.next()) |row| { 466 - const doc = Doc.fromLocalRow(row); 467 - if (!seen_uris.contains(doc.uri)) { 468 - try jw.write(doc.toJson()); 469 - } 470 - } 471 - 472 - // publication search 473 - var pub_rows = try local.query( 474 - \\SELECT f.uri, p.did, p.name, 475 - \\ snippet(publications_fts, 2, '', '', '...', 32) as snippet, 476 - \\ p.rkey, p.base_path, p.platform 477 - \\FROM publications_fts f 478 - \\JOIN publications p ON f.uri = p.uri 479 - \\WHERE publications_fts MATCH ? 480 - \\ORDER BY rank LIMIT 10 481 - , .{fts_query}); 482 - defer pub_rows.deinit(); 483 - 484 - while (pub_rows.next()) |row| { 485 - try jw.write(Pub.fromLocalRow(row).toJson()); 486 - } 487 - } 488 - 489 - try jw.endArray(); 490 - return try output.toOwnedSlice(); 491 - } 492 - 493 - fn getDocQuerySql(has_query: bool, has_tag: bool, has_platform: bool, has_since: bool) ?[]const u8 { 494 - if (has_query and has_tag and has_platform) return DocsByFtsAndTagAndPlatform.positional; 495 - if (has_query and has_tag) return DocsByFtsAndTag.positional; 496 - if (has_query and has_platform and has_since) return DocsByFtsAndPlatformAndSince.positional; 497 - if (has_query and has_platform) return DocsByFtsAndPlatform.positional; 498 - if (has_query and has_since) return DocsByFtsAndSince.positional; 499 - if (has_query) return DocsByFts.positional; 500 - if (has_tag and has_platform) return DocsByTagAndPlatform.positional; 501 - if (has_tag) return DocsByTag.positional; 502 - if (has_platform) return DocsByPlatform.positional; 503 - return null; 504 - } 505 - 506 - fn getDocQueryArgs(alloc: Allocator, fts_query: []const u8, tag: ?[]const u8, platform: ?[]const u8, since: ?[]const u8, has_query: bool, has_tag: bool, has_platform: bool, has_since: bool) ![]const []const u8 { 507 - if (has_query and has_tag and has_platform) { 508 - const args = try alloc.alloc([]const u8, 3); 509 - args[0] = fts_query; 510 - args[1] = tag.?; 511 - args[2] = platform.?; 512 - return args; 513 - } 514 - if (has_query and has_tag) { 515 - const args = try alloc.alloc([]const u8, 2); 516 - args[0] = fts_query; 517 - args[1] = tag.?; 518 - return args; 519 - } 520 - if (has_query and has_platform and has_since) { 521 - const args = try alloc.alloc([]const u8, 3); 522 - args[0] = fts_query; 523 - args[1] = platform.?; 524 - args[2] = since.?; 525 - return args; 526 - } 527 - if (has_query and has_platform) { 528 - const args = try alloc.alloc([]const u8, 2); 529 - args[0] = fts_query; 530 - args[1] = platform.?; 531 - return args; 532 - } 533 - if (has_query and has_since) { 534 - const args = try alloc.alloc([]const u8, 2); 535 - args[0] = fts_query; 536 - args[1] = since.?; 537 - return args; 538 - } 539 - if (has_query) { 540 - const args = try alloc.alloc([]const u8, 1); 541 - args[0] = fts_query; 542 - return args; 543 - } 544 - if (has_tag and has_platform) { 545 - const args = try alloc.alloc([]const u8, 2); 546 - args[0] = tag.?; 547 - args[1] = platform.?; 548 - return args; 549 - } 550 - if (has_tag) { 551 - const args = try alloc.alloc([]const u8, 1); 552 - args[0] = tag.?; 553 - return args; 554 - } 555 - if (has_platform) { 556 - const args = try alloc.alloc([]const u8, 1); 557 - args[0] = platform.?; 558 - return args; 559 - } 560 - return &.{}; 561 - } 562 - 563 193 /// Find documents similar to a given document using vector similarity 564 194 /// Uses brute-force cosine distance with caching (cache invalidated when doc count changes) 565 195 pub fn findSimilar(alloc: Allocator, uri: []const u8, limit: usize) ![]const u8 { ··· 585 215 // brute-force cosine similarity search (no vector index needed) 586 216 var res = c.query( 587 217 \\SELECT d2.uri, d2.did, d2.title, '' as snippet, 588 - \\ d2.created_at, d2.rkey, d2.base_path, d2.has_publication, 589 - \\ d2.platform, COALESCE(d2.path, '') as path 218 + \\ d2.created_at, d2.rkey, COALESCE(p.base_path, '') as base_path, 219 + \\ CASE WHEN d2.publication_uri != '' THEN 1 ELSE 0 END as has_publication, 220 + \\ d2.platform 590 221 \\FROM documents d1, documents d2 222 + \\LEFT JOIN publications p ON d2.publication_uri = p.uri 591 223 \\WHERE d1.uri = ? 592 224 \\ AND d2.uri != d1.uri 593 225 \\ AND d1.embedding IS NOT NULL ··· 650 282 /// Build FTS5 query with OR between terms: "cat dog" -> "cat OR dog*" 651 283 /// Uses OR for better recall with BM25 ranking (more matches = higher score) 652 284 /// Quoted queries are passed through as phrase matches: "exact phrase" -> "exact phrase" 653 - /// Separators match FTS5 unicode61 tokenizer: any non-alphanumeric character 654 285 pub fn buildFtsQuery(alloc: Allocator, query: []const u8) ![]const u8 { 655 286 if (query.len == 0) return ""; 656 287 ··· 669 300 } 670 301 671 302 // count words and total length 672 - // match FTS5 unicode61 tokenizer: non-alphanumeric = separator 673 303 var word_count: usize = 0; 674 304 var total_word_len: usize = 0; 675 305 var in_word = false; 676 306 for (trimmed) |c| { 677 - const is_alnum = (c >= 'a' and c <= 'z') or (c >= 'A' and c <= 'Z') or (c >= '0' and c <= '9'); 678 - if (!is_alnum) { 307 + const is_sep = (c == ' ' or c == '.'); 308 + if (is_sep) { 679 309 in_word = false; 680 310 } else { 681 311 if (!in_word) word_count += 1; ··· 691 321 const buf = try alloc.alloc(u8, total_word_len + 1); 692 322 var pos: usize = 0; 693 323 for (trimmed) |c| { 694 - const is_alnum = (c >= 'a' and c <= 'z') or (c >= 'A' and c <= 'Z') or (c >= '0' and c <= '9'); 695 - if (is_alnum) { 324 + if (c != ' ' and c != '.') { 696 325 buf[pos] = c; 697 326 pos += 1; 698 327 } ··· 711 340 in_word = false; 712 341 713 342 for (trimmed) |c| { 714 - const is_alnum = (c >= 'a' and c <= 'z') or (c >= 'A' and c <= 'Z') or (c >= '0' and c <= '9'); 715 - if (!is_alnum) { 343 + const is_sep = (c == ' ' or c == '.'); 344 + if (is_sep) { 716 345 if (in_word) { 717 346 // end of word - add " OR " if not last 718 347 current_word += 1; ··· 779 408 defer std.testing.allocator.free(result); 780 409 try std.testing.expectEqualStrings("foo OR bar*", result); 781 410 } 782 - 783 - test "buildFtsQuery: hyphens as separators" { 784 - const result = try buildFtsQuery(std.testing.allocator, "crypto-casino"); 785 - defer std.testing.allocator.free(result); 786 - try std.testing.expectEqualStrings("crypto OR casino*", result); 787 - } 788 - 789 - test "buildFtsQuery: mixed punctuation" { 790 - const result = try buildFtsQuery(std.testing.allocator, "don't@stop_now"); 791 - defer std.testing.allocator.free(result); 792 - try std.testing.expectEqualStrings("don OR t OR stop OR now*", result); 793 - }

+9 -98

backend/src/server.zig

··· 2 2 const net = std.net; 3 3 const http = std.http; 4 4 const mem = std.mem; 5 - const json = std.json; 6 - const logfire = @import("logfire"); 7 5 const activity = @import("activity.zig"); 8 6 const search = @import("search.zig"); 9 7 const stats = @import("stats.zig"); 10 - const timing = @import("timing.zig"); 11 8 const dashboard = @import("dashboard.zig"); 12 9 13 10 const HTTP_BUF_SIZE = 8192; ··· 27 24 while (true) { 28 25 var request = server.receiveHead() catch |err| { 29 26 if (err != error.HttpConnectionClosing and err != error.EndOfStream) { 30 - logfire.debug("http receive error: {}", .{err}); 27 + std.debug.print("http receive error: {}\n", .{err}); 31 28 } 32 29 return; 33 30 }; 34 31 handleRequest(&server, &request) catch |err| { 35 - logfire.err("request error: {}", .{err}); 32 + std.debug.print("request error: {}\n", .{err}); 36 33 return; 37 34 }; 38 35 if (!request.head.keep_alive) return; ··· 59 56 try sendJson(request, "{\"status\":\"ok\"}"); 60 57 } else if (mem.eql(u8, target, "/popular")) { 61 58 try handlePopular(request); 62 - } else if (mem.eql(u8, target, "/platforms")) { 63 - try handlePlatforms(request); 64 59 } else if (mem.eql(u8, target, "/dashboard")) { 65 60 try handleDashboard(request); 66 61 } else if (mem.eql(u8, target, "/api/dashboard")) { ··· 75 70 } 76 71 77 72 fn handleSearch(request: *http.Server.Request, target: []const u8) !void { 78 - const start_time = std.time.microTimestamp(); 79 - defer timing.record(.search, start_time); 80 - 81 - const span = logfire.span("http.search", .{}); 82 - defer span.end(); 83 - 84 73 var arena = std.heap.ArenaAllocator.init(std.heap.page_allocator); 85 74 defer arena.deinit(); 86 75 const alloc = arena.allocator(); 87 76 88 - // parse query params: /search?q=something&tag=foo&platform=leaflet&since=2025-01-01 77 + // parse query params: /search?q=something&tag=foo&platform=leaflet 89 78 const query = parseQueryParam(alloc, target, "q") catch ""; 90 79 const tag_filter = parseQueryParam(alloc, target, "tag") catch null; 91 80 const platform_filter = parseQueryParam(alloc, target, "platform") catch null; 92 - const since_filter = parseQueryParam(alloc, target, "since") catch null; 93 81 94 82 if (query.len == 0 and tag_filter == null) { 95 83 try sendJson(request, "{\"error\":\"enter a search term\"}"); ··· 97 85 } 98 86 99 87 // perform FTS search - arena handles cleanup 100 - const results = search.search(alloc, query, tag_filter, platform_filter, since_filter) catch |err| { 101 - logfire.err("search failed: {}", .{err}); 88 + const results = search.search(alloc, query, tag_filter, platform_filter) catch |err| { 102 89 stats.recordError(); 103 90 return err; 104 91 }; 105 92 stats.recordSearch(query); 106 - logfire.counter("search.requests", 1); 107 93 try sendJson(request, results); 108 94 } 109 95 110 96 fn handleTags(request: *http.Server.Request) !void { 111 - const start_time = std.time.microTimestamp(); 112 - defer timing.record(.tags, start_time); 113 - 114 - const span = logfire.span("http.tags", .{}); 115 - defer span.end(); 116 - 117 97 var arena = std.heap.ArenaAllocator.init(std.heap.page_allocator); 118 98 defer arena.deinit(); 119 99 const alloc = arena.allocator(); ··· 123 103 } 124 104 125 105 fn handlePopular(request: *http.Server.Request) !void { 126 - const start_time = std.time.microTimestamp(); 127 - defer timing.record(.popular, start_time); 128 - 129 - const span = logfire.span("http.popular", .{}); 130 - defer span.end(); 131 - 132 106 var arena = std.heap.ArenaAllocator.init(std.heap.page_allocator); 133 107 defer arena.deinit(); 134 108 const alloc = arena.allocator(); ··· 137 111 try sendJson(request, popular); 138 112 } 139 113 140 - fn handlePlatforms(request: *http.Server.Request) !void { 141 - var arena = std.heap.ArenaAllocator.init(std.heap.page_allocator); 142 - defer arena.deinit(); 143 - const alloc = arena.allocator(); 144 - 145 - const data = try stats.getPlatformCounts(alloc); 146 - try sendJson(request, data); 147 - } 148 - 149 114 fn parseQueryParam(alloc: std.mem.Allocator, target: []const u8, param: []const u8) ![]const u8 { 150 115 // look for ?param= or &param= 151 116 const patterns = [_][]const u8{ "?", "&" }; ··· 173 138 const alloc = arena.allocator(); 174 139 175 140 const db_stats = stats.getStats(); 176 - const all_timing = timing.getAllStats(); 177 141 178 - var output: std.Io.Writer.Allocating = .init(alloc); 179 - errdefer output.deinit(); 142 + var response: std.ArrayList(u8) = .{}; 143 + defer response.deinit(alloc); 180 144 181 - var jw: json.Stringify = .{ .writer = &output.writer }; 182 - try jw.beginObject(); 145 + try response.print(alloc, "{{\"documents\":{d},\"publications\":{d},\"cache_hits\":{d},\"cache_misses\":{d}}}", .{ db_stats.documents, db_stats.publications, db_stats.cache_hits, db_stats.cache_misses }); 183 146 184 - // db stats 185 - try jw.objectField("documents"); 186 - try jw.write(db_stats.documents); 187 - try jw.objectField("publications"); 188 - try jw.write(db_stats.publications); 189 - try jw.objectField("embeddings"); 190 - try jw.write(db_stats.embeddings); 191 - try jw.objectField("searches"); 192 - try jw.write(db_stats.searches); 193 - try jw.objectField("errors"); 194 - try jw.write(db_stats.errors); 195 - try jw.objectField("cache_hits"); 196 - try jw.write(db_stats.cache_hits); 197 - try jw.objectField("cache_misses"); 198 - try jw.write(db_stats.cache_misses); 199 - 200 - // timing stats per endpoint 201 - try jw.objectField("timing"); 202 - try jw.beginObject(); 203 - inline for (@typeInfo(timing.Endpoint).@"enum".fields, 0..) |field, i| { 204 - const t = all_timing[i]; 205 - try jw.objectField(field.name); 206 - try jw.beginObject(); 207 - try jw.objectField("count"); 208 - try jw.write(t.count); 209 - try jw.objectField("avg_ms"); 210 - try jw.write(t.avg_ms); 211 - try jw.objectField("p50_ms"); 212 - try jw.write(t.p50_ms); 213 - try jw.objectField("p95_ms"); 214 - try jw.write(t.p95_ms); 215 - try jw.objectField("p99_ms"); 216 - try jw.write(t.p99_ms); 217 - try jw.objectField("max_ms"); 218 - try jw.write(t.max_ms); 219 - try jw.endObject(); 220 - } 221 - try jw.endObject(); 222 - 223 - try jw.endObject(); 224 - 225 - try sendJson(request, try output.toOwnedSlice()); 147 + try sendJson(request, response.items); 226 148 } 227 149 228 150 fn sendJson(request: *http.Server.Request, body: []const u8) !void { ··· 276 198 try sendJson(request, json_response); 277 199 } 278 200 279 - fn getDashboardUrl() []const u8 { 280 - return std.posix.getenv("DASHBOARD_URL") orelse "https://leaflet-search.pages.dev/dashboard.html"; 281 - } 282 - 283 201 fn handleDashboard(request: *http.Server.Request) !void { 284 - const dashboard_url = getDashboardUrl(); 285 202 try request.respond("", .{ 286 203 .status = .moved_permanently, 287 204 .extra_headers = &.{ 288 - .{ .name = "location", .value = dashboard_url }, 205 + .{ .name = "location", .value = "https://leaflet-search.pages.dev/dashboard.html" }, 289 206 }, 290 207 }); 291 208 } 292 209 293 210 fn handleSimilar(request: *http.Server.Request, target: []const u8) !void { 294 - const start_time = std.time.microTimestamp(); 295 - defer timing.record(.similar, start_time); 296 - 297 - const span = logfire.span("http.similar", .{}); 298 - defer span.end(); 299 - 300 211 var arena = std.heap.ArenaAllocator.init(std.heap.page_allocator); 301 212 defer arena.deinit(); 302 213 const alloc = arena.allocator();

+8 -163

backend/src/stats.zig

··· 17 17 ); 18 18 19 19 pub fn getTags(alloc: Allocator) ![]const u8 { 20 - // try local SQLite first (faster) 21 - if (db.getLocalDb()) |local| { 22 - if (getTagsLocal(alloc, local)) |result| { 23 - return result; 24 - } else |_| {} 25 - } 26 - 27 - // fall back to Turso 28 20 const c = db.getClient() orelse return error.NotInitialized; 29 21 30 22 var output: std.Io.Writer.Allocating = .init(alloc); ··· 43 35 return try output.toOwnedSlice(); 44 36 } 45 37 46 - fn getTagsLocal(alloc: Allocator, local: *db.LocalDb) ![]const u8 { 47 - var output: std.Io.Writer.Allocating = .init(alloc); 48 - errdefer output.deinit(); 49 - 50 - var rows = try local.query( 51 - \\SELECT tag, COUNT(*) as count 52 - \\FROM document_tags 53 - \\GROUP BY tag 54 - \\ORDER BY count DESC 55 - \\LIMIT 100 56 - , .{}); 57 - defer rows.deinit(); 58 - 59 - var jw: json.Stringify = .{ .writer = &output.writer }; 60 - try jw.beginArray(); 61 - while (rows.next()) |row| { 62 - try jw.write(TagJson{ .tag = row.text(0), .count = row.int(1) }); 63 - } 64 - try jw.endArray(); 65 - return try output.toOwnedSlice(); 66 - } 67 - 68 38 pub const Stats = struct { 69 39 documents: i64, 70 40 publications: i64, 71 - embeddings: i64, 72 41 searches: i64, 73 42 errors: i64, 74 43 started_at: i64, 75 44 cache_hits: i64, 76 45 cache_misses: i64, 77 46 }; 78 - 79 - const default_stats: Stats = .{ .documents = 0, .publications = 0, .embeddings = 0, .searches = 0, .errors = 0, .started_at = 0, .cache_hits = 0, .cache_misses = 0 }; 80 47 81 48 pub fn getStats() Stats { 82 - // try local SQLite first (fast) 83 - if (db.getLocalDb()) |local| { 84 - if (getStatsLocal(local)) |result| { 85 - return result; 86 - } else |_| {} 87 - } 88 - 89 - // fall back to Turso (slow) 90 - const c = db.getClient() orelse return default_stats; 49 + const c = db.getClient() orelse return .{ .documents = 0, .publications = 0, .searches = 0, .errors = 0, .started_at = 0, .cache_hits = 0, .cache_misses = 0 }; 91 50 92 51 var res = c.query( 93 52 \\SELECT 94 53 \\ (SELECT COUNT(*) FROM documents) as docs, 95 54 \\ (SELECT COUNT(*) FROM publications) as pubs, 96 - \\ (SELECT COUNT(*) FROM documents WHERE embedding IS NOT NULL) as embeddings, 97 55 \\ (SELECT total_searches FROM stats WHERE id = 1) as searches, 98 56 \\ (SELECT total_errors FROM stats WHERE id = 1) as errors, 99 57 \\ (SELECT service_started_at FROM stats WHERE id = 1) as started_at, 100 58 \\ (SELECT COALESCE(cache_hits, 0) FROM stats WHERE id = 1) as cache_hits, 101 59 \\ (SELECT COALESCE(cache_misses, 0) FROM stats WHERE id = 1) as cache_misses 102 - , &.{}) catch return default_stats; 60 + , &.{}) catch return .{ .documents = 0, .publications = 0, .searches = 0, .errors = 0, .started_at = 0, .cache_hits = 0, .cache_misses = 0 }; 103 61 defer res.deinit(); 104 62 105 - const row = res.first() orelse return default_stats; 63 + const row = res.first() orelse return .{ .documents = 0, .publications = 0, .searches = 0, .errors = 0, .started_at = 0, .cache_hits = 0, .cache_misses = 0 }; 106 64 return .{ 107 65 .documents = row.int(0), 108 66 .publications = row.int(1), 109 - .embeddings = row.int(2), 110 - .searches = row.int(3), 111 - .errors = row.int(4), 112 - .started_at = row.int(5), 113 - .cache_hits = row.int(6), 114 - .cache_misses = row.int(7), 115 - }; 116 - } 117 - 118 - fn getStatsLocal(local: *db.LocalDb) !Stats { 119 - // get stats table from Turso (doesn't sync to local replica) 120 - const client = db.getClient() orelse return error.NotInitialized; 121 - var stats_res = client.query( 122 - \\SELECT total_searches, total_errors, service_started_at, 123 - \\ COALESCE(cache_hits, 0), COALESCE(cache_misses, 0) 124 - \\FROM stats WHERE id = 1 125 - , &.{}) catch return error.QueryFailed; 126 - defer stats_res.deinit(); 127 - const stats_row = stats_res.first() orelse return error.NoStats; 128 - 129 - // get document counts from local (fast) 130 - var rows = try local.query( 131 - \\SELECT 132 - \\ (SELECT COUNT(*) FROM documents) as docs, 133 - \\ (SELECT COUNT(*) FROM publications) as pubs, 134 - \\ (SELECT COUNT(*) FROM documents WHERE embedding IS NOT NULL) as embeddings 135 - , .{}); 136 - defer rows.deinit(); 137 - const row = rows.next() orelse return error.NoRows; 138 - 139 - return .{ 140 - .documents = row.int(0), 141 - .publications = row.int(1), 142 - .embeddings = row.int(2), 143 - .searches = stats_row.int(0), 144 - .errors = stats_row.int(1), 145 - .started_at = stats_row.int(2), 146 - .cache_hits = stats_row.int(3), 147 - .cache_misses = stats_row.int(4), 67 + .searches = row.int(2), 68 + .errors = row.int(3), 69 + .started_at = row.int(4), 70 + .cache_hits = row.int(5), 71 + .cache_misses = row.int(6), 148 72 }; 149 73 } 150 74 ··· 178 102 c.exec("UPDATE stats SET cache_misses = COALESCE(cache_misses, 0) + 1 WHERE id = 1", &.{}) catch {}; 179 103 } 180 104 181 - const PlatformCount = struct { platform: []const u8, count: i64 }; 182 - 183 - pub fn getPlatformCounts(alloc: Allocator) ![]const u8 { 184 - const c = db.getClient() orelse return error.NotInitialized; 185 - 186 - var output: std.Io.Writer.Allocating = .init(alloc); 187 - errdefer output.deinit(); 188 - 189 - var jw: json.Stringify = .{ .writer = &output.writer }; 190 - try jw.beginObject(); 191 - 192 - // documents by platform 193 - try jw.objectField("documents"); 194 - if (c.query("SELECT platform, COUNT(*) as count FROM documents GROUP BY platform ORDER BY count DESC", &.{})) |res_val| { 195 - var res = res_val; 196 - defer res.deinit(); 197 - try jw.beginArray(); 198 - for (res.rows) |row| try jw.write(PlatformCount{ .platform = row.text(0), .count = row.int(1) }); 199 - try jw.endArray(); 200 - } else |_| { 201 - try jw.beginArray(); 202 - try jw.endArray(); 203 - } 204 - 205 - // FTS document count 206 - try jw.objectField("fts_count"); 207 - if (c.query("SELECT COUNT(*) FROM documents_fts", &.{})) |res_val| { 208 - var res = res_val; 209 - defer res.deinit(); 210 - if (res.first()) |row| { 211 - try jw.write(row.int(0)); 212 - } else try jw.write(0); 213 - } else |_| try jw.write(0); 214 - 215 - // sample URIs from each platform (for debugging) 216 - try jw.objectField("sample_other"); 217 - if (c.query("SELECT uri FROM documents WHERE platform = 'other' LIMIT 3", &.{})) |res_val| { 218 - var res = res_val; 219 - defer res.deinit(); 220 - try jw.beginArray(); 221 - for (res.rows) |row| try jw.write(row.text(0)); 222 - try jw.endArray(); 223 - } else |_| { 224 - try jw.beginArray(); 225 - try jw.endArray(); 226 - } 227 - 228 - try jw.endObject(); 229 - return try output.toOwnedSlice(); 230 - } 231 - 232 105 pub fn getPopular(alloc: Allocator, limit: usize) ![]const u8 { 233 - // try local SQLite first (faster) 234 - if (db.getLocalDb()) |local| { 235 - if (getPopularLocal(alloc, local, limit)) |result| { 236 - return result; 237 - } else |_| {} 238 - } 239 - 240 - // fall back to Turso 241 106 const c = db.getClient() orelse return error.NotInitialized; 242 107 243 108 var output: std.Io.Writer.Allocating = .init(alloc); ··· 261 126 try jw.endArray(); 262 127 return try output.toOwnedSlice(); 263 128 } 264 - 265 - fn getPopularLocal(alloc: Allocator, local: *db.LocalDb, limit: usize) ![]const u8 { 266 - var output: std.Io.Writer.Allocating = .init(alloc); 267 - errdefer output.deinit(); 268 - 269 - _ = limit; // zqlite doesn't support runtime LIMIT, use fixed 10 270 - var rows = try local.query( 271 - "SELECT query, count FROM popular_searches ORDER BY count DESC LIMIT 10", 272 - .{}, 273 - ); 274 - defer rows.deinit(); 275 - 276 - var jw: json.Stringify = .{ .writer = &output.writer }; 277 - try jw.beginArray(); 278 - while (rows.next()) |row| { 279 - try jw.write(PopularJson{ .query = row.text(0), .count = row.int(1) }); 280 - } 281 - try jw.endArray(); 282 - return try output.toOwnedSlice(); 283 - }

+49 -112

backend/src/tap.zig

··· 5 5 const Allocator = mem.Allocator; 6 6 const websocket = @import("websocket"); 7 7 const zat = @import("zat"); 8 - const logfire = @import("logfire"); 9 8 const indexer = @import("indexer.zig"); 10 9 const extractor = @import("extractor.zig"); 11 10 ··· 49 48 if (connected) |_| { 50 49 // connection succeeded then closed - reset backoff 51 50 backoff = 1; 52 - logfire.info("tap connection closed, reconnecting immediately", .{}); 51 + std.debug.print("tap connection closed, reconnecting immediately...\n", .{}); 53 52 } else |err| { 54 53 // connection failed - backoff 55 - logfire.warn("tap error: {}, reconnecting in {d}s", .{ err, backoff }); 54 + std.debug.print("tap error: {}, reconnecting in {}s...\n", .{ err, backoff }); 56 55 posix.nanosleep(backoff, 0); 57 56 backoff = @min(backoff * 2, max_backoff); 58 57 } ··· 61 60 62 61 const Handler = struct { 63 62 allocator: Allocator, 64 - client: *websocket.Client, 65 63 msg_count: usize = 0, 66 - ack_buf: [64]u8 = undefined, 67 64 68 65 pub fn serverMessage(self: *Handler, data: []const u8) !void { 69 66 self.msg_count += 1; 70 - if (self.msg_count % 1000 == 0) { 71 - logfire.info("tap: processed {d} messages", .{self.msg_count}); 67 + if (self.msg_count % 100 == 1) { 68 + std.debug.print("tap: received {} messages\n", .{self.msg_count}); 72 69 } 73 - 74 - // extract message ID for ACK 75 - const msg_id = extractMessageId(self.allocator, data); 76 - 77 - // process the message 78 70 processMessage(self.allocator, data) catch |err| { 79 - logfire.err("message processing error: {}", .{err}); 80 - // still ACK even on error to avoid infinite retries 71 + std.debug.print("message processing error: {}\n", .{err}); 81 72 }; 82 - 83 - // send ACK if we have a message ID 84 - if (msg_id) |id| { 85 - self.sendAck(id); 86 - } 87 73 } 88 74 89 - fn sendAck(self: *Handler, msg_id: i64) void { 90 - const ack_json = std.fmt.bufPrint(&self.ack_buf, "{{\"type\":\"ack\",\"id\":{d}}}", .{msg_id}) catch |err| { 91 - logfire.err("tap: ACK format error: {}", .{err}); 92 - return; 93 - }; 94 - self.client.write(@constCast(ack_json)) catch |err| { 95 - logfire.err("tap: failed to send ACK: {}", .{err}); 96 - }; 75 + pub fn close(_: *Handler) void { 76 + std.debug.print("tap connection closed\n", .{}); 97 77 } 98 - 99 - pub fn close(_: *Handler) void {} 100 78 }; 101 79 102 - fn extractMessageId(allocator: Allocator, payload: []const u8) ?i64 { 103 - const parsed = json.parseFromSlice(json.Value, allocator, payload, .{}) catch return null; 104 - defer parsed.deinit(); 105 - return zat.json.getInt(parsed.value, "id"); 106 - } 107 - 108 80 fn connect(allocator: Allocator) !void { 109 81 const host = getTapHost(); 110 82 const port = getTapPort(); 111 83 const tls = useTls(); 112 84 const path = "/channel"; 113 85 114 - logfire.info("connecting to {s}://{s}:{d}{s}", .{ if (tls) "wss" else "ws", host, port, path }); 86 + std.debug.print("connecting to {s}://{s}:{d}{s}\n", .{ if (tls) "wss" else "ws", host, port, path }); 115 87 116 88 var client = websocket.Client.init(allocator, .{ 117 89 .host = host, ··· 119 91 .tls = tls, 120 92 .max_size = 1024 * 1024, // 1MB 121 93 }) catch |err| { 122 - logfire.err("websocket client init failed: {}", .{err}); 94 + std.debug.print("websocket client init failed: {}\n", .{err}); 123 95 return err; 124 96 }; 125 97 defer client.deinit(); ··· 128 100 const host_header = std.fmt.bufPrint(&host_header_buf, "Host: {s}\r\n", .{host}) catch host; 129 101 130 102 client.handshake(path, .{ .headers = host_header }) catch |err| { 131 - logfire.err("websocket handshake failed: {}", .{err}); 103 + std.debug.print("websocket handshake failed: {}\n", .{err}); 132 104 return err; 133 105 }; 134 106 135 - logfire.info("tap connected", .{}); 107 + std.debug.print("tap connected!\n", .{}); 136 108 137 - var handler = Handler{ .allocator = allocator, .client = &client }; 109 + var handler = Handler{ .allocator = allocator }; 138 110 client.readLoop(&handler) catch |err| { 139 - logfire.err("websocket read loop error: {}", .{err}); 111 + std.debug.print("websocket read loop error: {}\n", .{err}); 140 112 return err; 141 113 }; 142 114 } ··· 144 116 /// TAP record envelope - extracted via zat.json.extractAt 145 117 const TapRecord = struct { 146 118 collection: []const u8, 147 - action: []const u8, // "create", "update", "delete" 119 + action: zat.CommitAction, 148 120 did: []const u8, 149 121 rkey: []const u8, 150 - 151 - pub fn isCreate(self: TapRecord) bool { 152 - return mem.eql(u8, self.action, "create"); 153 - } 154 - pub fn isUpdate(self: TapRecord) bool { 155 - return mem.eql(u8, self.action, "update"); 156 - } 157 - pub fn isDelete(self: TapRecord) bool { 158 - return mem.eql(u8, self.action, "delete"); 159 - } 160 122 }; 161 123 162 124 /// Leaflet publication fields ··· 167 129 }; 168 130 169 131 fn processMessage(allocator: Allocator, payload: []const u8) !void { 170 - const parsed = json.parseFromSlice(json.Value, allocator, payload, .{}) catch { 171 - logfire.err("tap: JSON parse failed, first 100 bytes: {s}", .{payload[0..@min(payload.len, 100)]}); 172 - return; 173 - }; 132 + const parsed = json.parseFromSlice(json.Value, allocator, payload, .{}) catch return; 174 133 defer parsed.deinit(); 175 134 176 135 // check message type 177 - const msg_type = zat.json.getString(parsed.value, "type") orelse { 178 - logfire.warn("tap: no type field in message", .{}); 179 - return; 180 - }; 181 - 136 + const msg_type = zat.json.getString(parsed.value, "type") orelse return; 182 137 if (!mem.eql(u8, msg_type, "record")) return; 183 138 184 - // extract record envelope (extractAt ignores extra fields like live, rev, cid) 185 - const rec = zat.json.extractAt(TapRecord, allocator, parsed.value, .{"record"}) catch |err| { 186 - logfire.warn("tap: failed to extract record: {}", .{err}); 187 - return; 188 - }; 139 + // extract record envelope 140 + const rec = zat.json.extractAt(TapRecord, allocator, parsed.value, .{"record"}) catch return; 189 141 190 142 // validate DID 191 143 const did = zat.Did.parse(rec.did) orelse return; 192 144 193 - // build AT-URI string (no allocation - uses stack buffer) 194 - var uri_buf: [256]u8 = undefined; 195 - const uri = zat.AtUri.format(&uri_buf, did.raw, rec.collection, rec.rkey) orelse return; 196 - 197 - // span for the actual indexing work 198 - const span = logfire.span("tap.index_record", .{}); 199 - defer span.end(); 145 + // build AT-URI string 146 + const uri = try std.fmt.allocPrint(allocator, "at://{s}/{s}/{s}", .{ did.raw, rec.collection, rec.rkey }); 147 + defer allocator.free(uri); 200 148 201 - if (rec.isCreate() or rec.isUpdate()) { 202 - const inner_record = zat.json.getObject(parsed.value, "record.record") orelse return; 149 + switch (rec.action) { 150 + .create, .update => { 151 + const record_obj = zat.json.getObject(parsed.value, "record.record") orelse return; 203 152 204 - if (isDocumentCollection(rec.collection)) { 205 - processDocument(allocator, uri, did.raw, rec.rkey, inner_record, rec.collection) catch |err| { 206 - logfire.err("document processing error: {}", .{err}); 207 - }; 208 - } else if (isPublicationCollection(rec.collection)) { 209 - processPublication(allocator, uri, did.raw, rec.rkey, inner_record) catch |err| { 210 - logfire.err("publication processing error: {}", .{err}); 211 - }; 212 - } 213 - } else if (rec.isDelete()) { 214 - if (isDocumentCollection(rec.collection)) { 215 - indexer.deleteDocument(uri); 216 - } else if (isPublicationCollection(rec.collection)) { 217 - indexer.deletePublication(uri); 218 - } 153 + if (isDocumentCollection(rec.collection)) { 154 + processDocument(allocator, uri, did.raw, rec.rkey, record_obj, rec.collection) catch |err| { 155 + std.debug.print("document processing error: {}\n", .{err}); 156 + }; 157 + } else if (isPublicationCollection(rec.collection)) { 158 + processPublication(allocator, uri, did.raw, rec.rkey, record_obj) catch |err| { 159 + std.debug.print("publication processing error: {}\n", .{err}); 160 + }; 161 + } 162 + }, 163 + .delete => { 164 + if (isDocumentCollection(rec.collection)) { 165 + indexer.deleteDocument(uri); 166 + std.debug.print("deleted document: {s}\n", .{uri}); 167 + } else if (isPublicationCollection(rec.collection)) { 168 + indexer.deletePublication(uri); 169 + std.debug.print("deleted publication: {s}\n", .{uri}); 170 + } 171 + }, 219 172 } 220 173 } 221 174 222 175 fn processDocument(allocator: Allocator, uri: []const u8, did: []const u8, rkey: []const u8, record: json.ObjectMap, collection: []const u8) !void { 223 176 var doc = extractor.extractDocument(allocator, record, collection) catch |err| { 224 177 if (err != error.NoContent and err != error.MissingTitle) { 225 - logfire.warn("extraction error for {s}: {}", .{ uri, err }); 178 + std.debug.print("extraction error for {s}: {}\n", .{ uri, err }); 226 179 } 227 180 return; 228 181 }; ··· 239 192 doc.tags, 240 193 doc.platformName(), 241 194 doc.source_collection, 242 - doc.path, 243 195 ); 244 - logfire.counter("tap.documents_indexed", 1); 196 + std.debug.print("indexed document: {s} [{s}] ({} chars, {} tags)\n", .{ uri, doc.platformName(), doc.content.len, doc.tags.len }); 245 197 } 246 198 247 - fn processPublication(_: Allocator, uri: []const u8, did: []const u8, rkey: []const u8, record: json.ObjectMap) !void { 199 + fn processPublication(allocator: Allocator, uri: []const u8, did: []const u8, rkey: []const u8, record: json.ObjectMap) !void { 248 200 const record_val: json.Value = .{ .object = record }; 249 - 250 - // extract required field 251 - const name = zat.json.getString(record_val, "name") orelse return; 252 - const description = zat.json.getString(record_val, "description"); 253 - 254 - // base_path: try leaflet's "base_path", then site.standard's "url" 255 - // url is full URL like "https://devlog.pckt.blog", we need just the host 256 - const base_path = zat.json.getString(record_val, "base_path") orelse 257 - stripUrlScheme(zat.json.getString(record_val, "url")); 258 - 259 - try indexer.insertPublication(uri, did, rkey, name, description, base_path); 260 - logfire.counter("tap.publications_indexed", 1); 261 - } 201 + const pub_data = zat.json.extractAt(LeafletPublication, allocator, record_val, .{}) catch return; 262 202 263 - fn stripUrlScheme(url: ?[]const u8) ?[]const u8 { 264 - const u = url orelse return null; 265 - if (mem.startsWith(u8, u, "https://")) return u["https://".len..]; 266 - if (mem.startsWith(u8, u, "http://")) return u["http://".len..]; 267 - return u; 203 + try indexer.insertPublication(uri, did, rkey, pub_data.name, pub_data.description, pub_data.base_path); 204 + std.debug.print("indexed publication: {s} (base_path: {s})\n", .{ uri, pub_data.base_path orelse "none" }); 268 205 }

-265

backend/src/timing.zig

··· 1 - const std = @import("std"); 2 - 3 - /// endpoints we track latency for 4 - pub const Endpoint = enum { 5 - search, 6 - similar, 7 - tags, 8 - popular, 9 - 10 - pub fn name(self: Endpoint) []const u8 { 11 - return @tagName(self); 12 - } 13 - }; 14 - 15 - const SAMPLE_COUNT = 1000; 16 - const ENDPOINT_COUNT = @typeInfo(Endpoint).@"enum".fields.len; 17 - const PERSIST_PATH = "/data/timing.bin"; 18 - const PERSIST_PATH_HOURLY = "/data/timing_hourly.bin"; 19 - const HOURS_TO_KEEP = 24; 20 - 21 - /// per-endpoint latency buffer 22 - const LatencyBuffer = struct { 23 - samples: [SAMPLE_COUNT]u32 = .{0} ** SAMPLE_COUNT, // microseconds 24 - count: usize = 0, 25 - head: usize = 0, 26 - total_count: u64 = 0, 27 - 28 - fn record(self: *LatencyBuffer, latency_us: u32) void { 29 - self.samples[self.head] = latency_us; 30 - self.head = (self.head + 1) % SAMPLE_COUNT; 31 - if (self.count < SAMPLE_COUNT) self.count += 1; 32 - self.total_count += 1; 33 - } 34 - }; 35 - 36 - /// hourly bucket for time series 37 - const HourlyBucket = struct { 38 - hour: i64 = 0, // unix timestamp of hour start 39 - count: u32 = 0, 40 - sum_us: u64 = 0, 41 - max_us: u32 = 0, 42 - 43 - fn record(self: *HourlyBucket, hour: i64, latency_us: u32) void { 44 - if (self.hour != hour) { 45 - // new hour, reset 46 - self.hour = hour; 47 - self.count = 0; 48 - self.sum_us = 0; 49 - self.max_us = 0; 50 - } 51 - self.count += 1; 52 - self.sum_us += latency_us; 53 - if (latency_us > self.max_us) self.max_us = latency_us; 54 - } 55 - }; 56 - 57 - /// time series data point for API response 58 - pub const TimeSeriesPoint = struct { 59 - hour: i64, 60 - count: u32, 61 - avg_ms: f64, 62 - max_ms: f64, 63 - }; 64 - 65 - /// computed stats for an endpoint 66 - pub const EndpointStats = struct { 67 - count: u64 = 0, 68 - avg_ms: f64 = 0, 69 - p50_ms: f64 = 0, 70 - p95_ms: f64 = 0, 71 - p99_ms: f64 = 0, 72 - max_ms: f64 = 0, 73 - }; 74 - 75 - var buffers: [ENDPOINT_COUNT]LatencyBuffer = [_]LatencyBuffer{.{}} ** ENDPOINT_COUNT; 76 - var hourly: [ENDPOINT_COUNT][HOURS_TO_KEEP]HourlyBucket = [_][HOURS_TO_KEEP]HourlyBucket{[_]HourlyBucket{.{}} ** HOURS_TO_KEEP} ** ENDPOINT_COUNT; 77 - var mutex: std.Thread.Mutex = .{}; 78 - var initialized: bool = false; 79 - 80 - fn getCurrentHour() i64 { 81 - const now_s = @divFloor(std.time.timestamp(), 3600) * 3600; 82 - return now_s; 83 - } 84 - 85 - fn getHourIndex(hour: i64) usize { 86 - // use hour as index into ring buffer 87 - return @intCast(@mod(@divFloor(hour, 3600), HOURS_TO_KEEP)); 88 - } 89 - 90 - /// record a request latency (call after request completes) 91 - pub fn record(endpoint: Endpoint, start_time: i64) void { 92 - const now = std.time.microTimestamp(); 93 - const elapsed_us: u32 = @intCast(@max(0, now - start_time)); 94 - const current_hour = getCurrentHour(); 95 - const hour_idx = getHourIndex(current_hour); 96 - 97 - mutex.lock(); 98 - defer mutex.unlock(); 99 - 100 - ensureInitialized(); 101 - 102 - const ep_idx = @intFromEnum(endpoint); 103 - buffers[ep_idx].record(elapsed_us); 104 - hourly[ep_idx][hour_idx].record(current_hour, elapsed_us); 105 - 106 - // persist immediately 107 - persistLocked(); 108 - persistHourlyLocked(); 109 - } 110 - 111 - fn loadLocked() void { 112 - const file = std.fs.openFileAbsolute(PERSIST_PATH, .{}) catch return; 113 - defer file.close(); 114 - 115 - // read entire file at once (small file, ~16KB per endpoint) 116 - var file_buf: [ENDPOINT_COUNT * (@sizeOf([SAMPLE_COUNT]u32) + @sizeOf(usize) * 2 + @sizeOf(u64))]u8 = undefined; 117 - const bytes_read = file.readAll(&file_buf) catch return; 118 - if (bytes_read != file_buf.len) return; // incomplete file 119 - 120 - var offset: usize = 0; 121 - for (&buffers) |*buf| { 122 - const samples_size = @sizeOf([SAMPLE_COUNT]u32); 123 - buf.samples = std.mem.bytesToValue([SAMPLE_COUNT]u32, file_buf[offset..][0..samples_size]); 124 - offset += samples_size; 125 - 126 - buf.count = std.mem.readInt(usize, file_buf[offset..][0..@sizeOf(usize)], .little); 127 - offset += @sizeOf(usize); 128 - 129 - buf.head = std.mem.readInt(usize, file_buf[offset..][0..@sizeOf(usize)], .little); 130 - offset += @sizeOf(usize); 131 - 132 - buf.total_count = std.mem.readInt(u64, file_buf[offset..][0..@sizeOf(u64)], .little); 133 - offset += @sizeOf(u64); 134 - } 135 - } 136 - 137 - fn persistLocked() void { 138 - const file = std.fs.createFileAbsolute(PERSIST_PATH, .{}) catch return; 139 - defer file.close(); 140 - 141 - // write all buffers 142 - for (buffers) |buf| { 143 - file.writeAll(std.mem.asBytes(&buf.samples)) catch return; 144 - file.writeAll(std.mem.asBytes(&buf.count)) catch return; 145 - file.writeAll(std.mem.asBytes(&buf.head)) catch return; 146 - file.writeAll(std.mem.asBytes(&buf.total_count)) catch return; 147 - } 148 - } 149 - 150 - fn loadHourlyLocked() void { 151 - const file = std.fs.openFileAbsolute(PERSIST_PATH_HOURLY, .{}) catch return; 152 - defer file.close(); 153 - 154 - const bucket_size = @sizeOf(HourlyBucket); 155 - const total_size = ENDPOINT_COUNT * HOURS_TO_KEEP * bucket_size; 156 - var file_buf: [total_size]u8 = undefined; 157 - const bytes_read = file.readAll(&file_buf) catch return; 158 - if (bytes_read != total_size) return; 159 - 160 - var offset: usize = 0; 161 - for (&hourly) |*ep_buckets| { 162 - for (ep_buckets) |*bucket| { 163 - bucket.* = std.mem.bytesToValue(HourlyBucket, file_buf[offset..][0..bucket_size]); 164 - offset += bucket_size; 165 - } 166 - } 167 - } 168 - 169 - fn persistHourlyLocked() void { 170 - const file = std.fs.createFileAbsolute(PERSIST_PATH_HOURLY, .{}) catch return; 171 - defer file.close(); 172 - 173 - for (hourly) |ep_buckets| { 174 - for (ep_buckets) |bucket| { 175 - file.writeAll(std.mem.asBytes(&bucket)) catch return; 176 - } 177 - } 178 - } 179 - 180 - fn ensureInitialized() void { 181 - if (!initialized) { 182 - initialized = true; 183 - loadLocked(); 184 - loadHourlyLocked(); 185 - } 186 - } 187 - 188 - /// get stats for a specific endpoint 189 - pub fn getStats(endpoint: Endpoint) EndpointStats { 190 - mutex.lock(); 191 - defer mutex.unlock(); 192 - 193 - ensureInitialized(); 194 - 195 - const buf = &buffers[@intFromEnum(endpoint)]; 196 - if (buf.count == 0) return .{}; 197 - 198 - // copy and sort for percentiles 199 - var sorted: [SAMPLE_COUNT]u32 = undefined; 200 - @memcpy(sorted[0..buf.count], buf.samples[0..buf.count]); 201 - std.mem.sort(u32, sorted[0..buf.count], {}, std.sort.asc(u32)); 202 - 203 - var sum: u64 = 0; 204 - for (sorted[0..buf.count]) |v| sum += v; 205 - 206 - const count = buf.count; 207 - return .{ 208 - .count = buf.total_count, 209 - .avg_ms = @as(f64, @floatFromInt(sum)) / @as(f64, @floatFromInt(count)) / 1000.0, 210 - .p50_ms = @as(f64, @floatFromInt(sorted[count / 2])) / 1000.0, 211 - .p95_ms = @as(f64, @floatFromInt(sorted[(count * 95) / 100])) / 1000.0, 212 - .p99_ms = @as(f64, @floatFromInt(sorted[(count * 99) / 100])) / 1000.0, 213 - .max_ms = @as(f64, @floatFromInt(sorted[count - 1])) / 1000.0, 214 - }; 215 - } 216 - 217 - /// get stats for all endpoints 218 - pub fn getAllStats() [ENDPOINT_COUNT]EndpointStats { 219 - var result: [ENDPOINT_COUNT]EndpointStats = undefined; 220 - for (0..ENDPOINT_COUNT) |i| { 221 - result[i] = getStats(@enumFromInt(i)); 222 - } 223 - return result; 224 - } 225 - 226 - /// get time series for an endpoint (last 24 hours) 227 - pub fn getTimeSeries(endpoint: Endpoint) [HOURS_TO_KEEP]TimeSeriesPoint { 228 - mutex.lock(); 229 - defer mutex.unlock(); 230 - 231 - ensureInitialized(); 232 - 233 - const current_hour = getCurrentHour(); 234 - const ep_buckets = hourly[@intFromEnum(endpoint)]; 235 - var result: [HOURS_TO_KEEP]TimeSeriesPoint = undefined; 236 - 237 - // return hours in chronological order, oldest first 238 - for (0..HOURS_TO_KEEP) |i| { 239 - const hours_ago = HOURS_TO_KEEP - 1 - i; 240 - const hour = current_hour - @as(i64, @intCast(hours_ago)) * 3600; 241 - const idx = getHourIndex(hour); 242 - const bucket = ep_buckets[idx]; 243 - 244 - if (bucket.hour == hour and bucket.count > 0) { 245 - result[i] = .{ 246 - .hour = hour, 247 - .count = bucket.count, 248 - .avg_ms = @as(f64, @floatFromInt(bucket.sum_us)) / @as(f64, @floatFromInt(bucket.count)) / 1000.0, 249 - .max_ms = @as(f64, @floatFromInt(bucket.max_us)) / 1000.0, 250 - }; 251 - } else { 252 - result[i] = .{ .hour = hour, .count = 0, .avg_ms = 0, .max_ms = 0 }; 253 - } 254 - } 255 - return result; 256 - } 257 - 258 - /// get time series for all endpoints 259 - pub fn getAllTimeSeries() [ENDPOINT_COUNT][HOURS_TO_KEEP]TimeSeriesPoint { 260 - var result: [ENDPOINT_COUNT][HOURS_TO_KEEP]TimeSeriesPoint = undefined; 261 - for (0..ENDPOINT_COUNT) |i| { 262 - result[i] = getTimeSeries(@enumFromInt(i)); 263 - } 264 - return result; 265 - }

-198

docs/api.md

··· 1 - # API reference 2 - 3 - base URL: `https://leaflet-search-backend.fly.dev` 4 - 5 - ## endpoints 6 - 7 - ### search 8 - 9 - ``` 10 - GET /search?q=<query>&tag=<tag>&platform=<platform>&since=<date> 11 - ``` 12 - 13 - full-text search across documents and publications. 14 - 15 - **parameters:** 16 - | param | type | required | description | 17 - |-------|------|----------|-------------| 18 - | `q` | string | no* | search query (titles and content) | 19 - | `tag` | string | no | filter by tag (documents only) | 20 - | `platform` | string | no | filter by platform: `leaflet`, `pckt`, `offprint`, `greengale`, `other` | 21 - | `since` | string | no | ISO date, filter to documents created after | 22 - 23 - *at least one of `q` or `tag` required 24 - 25 - **response:** 26 - ```json 27 - [ 28 - { 29 - "type": "article|looseleaf|publication", 30 - "uri": "at://did:plc:.../collection/rkey", 31 - "did": "did:plc:...", 32 - "title": "document title", 33 - "snippet": "...matched text...", 34 - "createdAt": "2025-01-15T...", 35 - "rkey": "abc123", 36 - "basePath": "gyst.leaflet.pub", 37 - "platform": "leaflet", 38 - "path": "/001" 39 - } 40 - ] 41 - ``` 42 - 43 - **result types:** 44 - - `article`: document in a publication 45 - - `looseleaf`: standalone document (no publication) 46 - - `publication`: the publication itself (only returned for text queries, not tag/platform filters) 47 - 48 - **ranking:** hybrid BM25 + recency. text relevance primary, recent docs boosted (~1 point per 30 days). 49 - 50 - ### similar 51 - 52 - ``` 53 - GET /similar?uri=<at-uri> 54 - ``` 55 - 56 - find semantically similar documents using vector similarity (voyage-3-lite embeddings). 57 - 58 - **parameters:** 59 - | param | type | required | description | 60 - |-------|------|----------|-------------| 61 - | `uri` | string | yes | AT-URI of source document | 62 - 63 - **response:** same format as search (array of results) 64 - 65 - ### tags 66 - 67 - ``` 68 - GET /tags 69 - ``` 70 - 71 - list all tags with document counts, sorted by popularity. 72 - 73 - **response:** 74 - ```json 75 - [ 76 - {"tag": "programming", "count": 42}, 77 - {"tag": "rust", "count": 15} 78 - ] 79 - ``` 80 - 81 - ### popular 82 - 83 - ``` 84 - GET /popular 85 - ``` 86 - 87 - popular search queries. 88 - 89 - **response:** 90 - ```json 91 - [ 92 - {"query": "rust async", "count": 12}, 93 - {"query": "leaflet", "count": 8} 94 - ] 95 - ``` 96 - 97 - ### platforms 98 - 99 - ``` 100 - GET /platforms 101 - ``` 102 - 103 - document counts by platform. 104 - 105 - **response:** 106 - ```json 107 - [ 108 - {"platform": "leaflet", "count": 2500}, 109 - {"platform": "pckt", "count": 800}, 110 - {"platform": "greengale", "count": 150}, 111 - {"platform": "offprint", "count": 50}, 112 - {"platform": "other", "count": 100} 113 - ] 114 - ``` 115 - 116 - ### stats 117 - 118 - ``` 119 - GET /stats 120 - ``` 121 - 122 - index statistics and request timing. 123 - 124 - **response:** 125 - ```json 126 - { 127 - "documents": 3500, 128 - "publications": 120, 129 - "embeddings": 3200, 130 - "searches": 5000, 131 - "errors": 5, 132 - "cache_hits": 1200, 133 - "cache_misses": 800, 134 - "timing": { 135 - "search": {"count": 1000, "avg_ms": 25, "p50_ms": 20, "p95_ms": 50, "p99_ms": 80, "max_ms": 150}, 136 - "similar": {"count": 200, "avg_ms": 150, "p50_ms": 140, "p95_ms": 200, "p99_ms": 250, "max_ms": 300}, 137 - "tags": {"count": 500, "avg_ms": 5, "p50_ms": 4, "p95_ms": 10, "p99_ms": 15, "max_ms": 25}, 138 - "popular": {"count": 300, "avg_ms": 3, "p50_ms": 2, "p95_ms": 5, "p99_ms": 8, "max_ms": 12} 139 - } 140 - } 141 - ``` 142 - 143 - ### activity 144 - 145 - ``` 146 - GET /activity 147 - ``` 148 - 149 - hourly activity counts (last 24 hours). 150 - 151 - **response:** 152 - ```json 153 - [12, 8, 5, 3, 2, 1, 0, 0, 1, 5, 15, 25, 30, 28, 22, 18, 20, 25, 30, 35, 28, 20, 15, 10] 154 - ``` 155 - 156 - ### dashboard 157 - 158 - ``` 159 - GET /api/dashboard 160 - ``` 161 - 162 - rich dashboard data for analytics UI. 163 - 164 - **response:** 165 - ```json 166 - { 167 - "startedAt": 1705000000, 168 - "searches": 5000, 169 - "publications": 120, 170 - "documents": 3500, 171 - "platforms": [{"platform": "leaflet", "count": 2500}], 172 - "tags": [{"tag": "programming", "count": 42}], 173 - "timeline": [{"date": "2025-01-15", "count": 25}], 174 - "topPubs": [{"name": "gyst", "basePath": "gyst.leaflet.pub", "count": 150}], 175 - "timing": {...} 176 - } 177 - ``` 178 - 179 - ### health 180 - 181 - ``` 182 - GET /health 183 - ``` 184 - 185 - **response:** 186 - ```json 187 - {"status": "ok"} 188 - ``` 189 - 190 - ## building URLs 191 - 192 - documents can be accessed on the web via their `basePath` and `rkey`: 193 - - articles: `https://{basePath}/{rkey}` or `https://{basePath}{path}` if path is set 194 - - publications: `https://{basePath}` 195 - 196 - examples: 197 - - `https://gyst.leaflet.pub/3ldasifz7bs2l` 198 - - `https://greengale.app/3fz.org/001`

-90

docs/content-extraction.md

··· 1 - # content extraction for site.standard.document 2 - 3 - lessons learned from implementing cross-platform content extraction. 4 - 5 - ## the problem 6 - 7 - [eli mallon raised this question](https://bsky.app/profile/iame.li/post/3md4s4vm2os2y): 8 - 9 - > The `site.standard.document` "content" field kinda confuses me. I see my leaflet posts have a $type field of "pub.leaflet.content". So if I were writing a renderer for site.standard.document records, presumably I'd have to know about separate things for leaflet, pckt, and offprint. 10 - 11 - short answer: yes. but once you handle `content.pages` extraction, it's straightforward. 12 - 13 - ## textContent: platform-dependent 14 - 15 - `site.standard.document` has a `textContent` field for pre-flattened plaintext: 16 - 17 - ```json 18 - { 19 - "title": "my post", 20 - "textContent": "the full text content, ready for indexing...", 21 - "content": { 22 - "$type": "blog.pckt.content", 23 - "items": [ /* platform-specific blocks */ ] 24 - } 25 - } 26 - ``` 27 - 28 - **pckt, offprint, greengale** populate `textContent`. extraction is trivial. 29 - 30 - **leaflet** intentionally leaves `textContent` null to avoid inflating record size. content lives in `content.pages[].blocks[].block.plaintext`. 31 - 32 - ## extraction strategy 33 - 34 - priority order (in `extractor.zig`): 35 - 36 - 1. `textContent` - use if present 37 - 2. `pages` - top-level blocks (pub.leaflet.document) 38 - 3. `content.pages` - nested blocks (site.standard.document with pub.leaflet.content) 39 - 40 - ```zig 41 - // try textContent first 42 - if (zat.json.getString(record, "textContent")) |text| { 43 - return text; 44 - } 45 - 46 - // fall back to block parsing 47 - const pages = zat.json.getArray(record, "pages") orelse 48 - zat.json.getArray(record, "content.pages"); 49 - ``` 50 - 51 - the key insight: if you extract from `content.pages` correctly, you're good. no need for extra network calls. 52 - 53 - ## deduplication 54 - 55 - documents can appear in both collections with identical `(did, rkey)`: 56 - - `site.standard.document` 57 - - `pub.leaflet.document` 58 - 59 - handle with `ON CONFLICT`: 60 - 61 - ```sql 62 - INSERT INTO documents (uri, ...) 63 - ON CONFLICT(uri) DO UPDATE SET ... 64 - ``` 65 - 66 - note: leaflet is phasing out `pub.leaflet.document` records, keeping old ones for backwards compat. 67 - 68 - ## platform detection 69 - 70 - collection name doesn't indicate platform for `site.standard.*` records. infer from publication `basePath`: 71 - 72 - | basePath contains | platform | 73 - |-------------------|----------| 74 - | `leaflet.pub` | leaflet | 75 - | `pckt.blog` | pckt | 76 - | `offprint.app` | offprint | 77 - | `greengale.app` | greengale | 78 - | (none) | other | 79 - 80 - ## summary 81 - 82 - - **pckt/offprint/greengale**: use `textContent` directly 83 - - **leaflet**: extract from `content.pages[].blocks[].block.plaintext` 84 - - **deduplication**: `ON CONFLICT` on `(did, rkey)` or `uri` 85 - - **platform**: infer from publication basePath, not collection name 86 - 87 - ## code references 88 - 89 - - `backend/src/extractor.zig` - content extraction logic 90 - - `backend/src/indexer.zig:99-112` - platform detection from basePath

-226

docs/scratch/leaflet-publishing-plan.md

··· 1 - # publishing to leaflet.pub 2 - 3 - ## goal 4 - 5 - publish markdown docs to both: 6 - 1. `site.standard.document` (for search/interop) - already working 7 - 2. `pub.leaflet.document` (for leaflet.pub display) - this plan 8 - 9 - ## the mapping 10 - 11 - ### block types 12 - 13 - | markdown | leaflet block | 14 - |----------|---------------| 15 - | `# heading` | `pub.leaflet.blocks.header` (level 1-6) | 16 - | paragraph | `pub.leaflet.blocks.text` | 17 - | ``` code ``` | `pub.leaflet.blocks.code` | 18 - | `> quote` | `pub.leaflet.blocks.blockquote` | 19 - | `---` | `pub.leaflet.blocks.horizontalRule` | 20 - | `- item` | `pub.leaflet.blocks.unorderedList` | 21 - | `![alt](src)` | `pub.leaflet.blocks.image` (requires blob upload) | 22 - | `[text](url)` (standalone) | `pub.leaflet.blocks.website` | 23 - 24 - ### inline formatting (facets) 25 - 26 - leaflet uses byte-indexed facets for inline formatting within text blocks: 27 - 28 - ```json 29 - { 30 - "$type": "pub.leaflet.blocks.text", 31 - "plaintext": "hello world with bold text", 32 - "facets": [{ 33 - "index": { "byteStart": 17, "byteEnd": 21 }, 34 - "features": [{ "$type": "pub.leaflet.richtext.facet#bold" }] 35 - }] 36 - } 37 - ``` 38 - 39 - | markdown | facet type | 40 - |----------|------------| 41 - | `**bold**` | `pub.leaflet.richtext.facet#bold` | 42 - | `*italic*` | `pub.leaflet.richtext.facet#italic` | 43 - | `` `code` `` | `pub.leaflet.richtext.facet#code` | 44 - | `[text](url)` | `pub.leaflet.richtext.facet#link` | 45 - | `~~strike~~` | `pub.leaflet.richtext.facet#strikethrough` | 46 - 47 - ## record structure 48 - 49 - ```json 50 - { 51 - "$type": "pub.leaflet.document", 52 - "author": "did:plc:...", 53 - "title": "document title", 54 - "description": "optional description", 55 - "publishedAt": "2026-01-06T00:00:00Z", 56 - "publication": "at://did:plc:.../pub.leaflet.publication/rkey", 57 - "tags": ["tag1", "tag2"], 58 - "pages": [{ 59 - "$type": "pub.leaflet.pages.linearDocument", 60 - "id": "page-uuid", 61 - "blocks": [ 62 - { 63 - "$type": "pub.leaflet.pages.linearDocument#block", 64 - "block": { /* one of the block types above */ } 65 - } 66 - ] 67 - }] 68 - } 69 - ``` 70 - 71 - ## implementation plan 72 - 73 - ### phase 1: markdown parser 74 - 75 - add a simple markdown block parser to zat or the publish script: 76 - 77 - ```zig 78 - const BlockType = enum { 79 - heading, 80 - paragraph, 81 - code, 82 - blockquote, 83 - horizontal_rule, 84 - unordered_list, 85 - image, 86 - }; 87 - 88 - const Block = struct { 89 - type: BlockType, 90 - content: []const u8, 91 - level: ?u8 = null, // for headings 92 - language: ?[]const u8 = null, // for code blocks 93 - alt: ?[]const u8 = null, // for images 94 - src: ?[]const u8 = null, // for images 95 - }; 96 - 97 - fn parseMarkdownBlocks(allocator: Allocator, markdown: []const u8) ![]Block 98 - ``` 99 - 100 - parsing approach: 101 - - split on blank lines to get blocks 102 - - identify block type by first characters: 103 - - `#` → heading (count `#` for level) 104 - - ``` → code block (capture until closing ```) 105 - - `>` → blockquote 106 - - `---` → horizontal rule 107 - - `-` or `*` at start → list item 108 - - `![` → image 109 - - else → paragraph 110 - 111 - ### phase 2: inline facet extraction 112 - 113 - for text blocks, extract inline formatting: 114 - 115 - ```zig 116 - const Facet = struct { 117 - byte_start: usize, 118 - byte_end: usize, 119 - feature: FacetFeature, 120 - }; 121 - 122 - const FacetFeature = union(enum) { 123 - bold, 124 - italic, 125 - code, 126 - link: []const u8, // url 127 - strikethrough, 128 - }; 129 - 130 - fn extractFacets(allocator: Allocator, text: []const u8) !struct { 131 - plaintext: []const u8, 132 - facets: []Facet, 133 - } 134 - ``` 135 - 136 - approach: 137 - - scan for `**`, `*`, `` ` ``, `[`, `~~` 138 - - track byte positions as we strip markers 139 - - build facet list with adjusted indices 140 - 141 - ### phase 3: image blob upload 142 - 143 - images need to be uploaded as blobs before referencing: 144 - 145 - ```zig 146 - fn uploadImageBlob(client: *XrpcClient, allocator: Allocator, image_path: []const u8) !BlobRef 147 - ``` 148 - 149 - for now, could skip images or require them to already be uploaded. 150 - 151 - ### phase 4: json serialization 152 - 153 - build the full `pub.leaflet.document` record: 154 - 155 - ```zig 156 - const LeafletDocument = struct { 157 - @"$type": []const u8 = "pub.leaflet.document", 158 - author: []const u8, 159 - title: []const u8, 160 - description: ?[]const u8 = null, 161 - publishedAt: []const u8, 162 - publication: ?[]const u8 = null, 163 - tags: ?[][]const u8 = null, 164 - pages: []Page, 165 - }; 166 - 167 - const Page = struct { 168 - @"$type": []const u8 = "pub.leaflet.pages.linearDocument", 169 - id: []const u8, 170 - blocks: []BlockWrapper, 171 - }; 172 - ``` 173 - 174 - ### phase 5: integrate into publish-docs.zig 175 - 176 - update the publish script to: 177 - 1. parse markdown into blocks 178 - 2. convert to leaflet structure 179 - 3. publish `pub.leaflet.document` alongside `site.standard.document` 180 - 181 - ```zig 182 - // existing: publish site.standard.document 183 - try putRecord(&client, allocator, session.did, "site.standard.document", tid.str(), doc_record); 184 - 185 - // new: also publish pub.leaflet.document 186 - const leaflet_record = try markdownToLeaflet(allocator, content, title, session.did, pub_uri); 187 - try putRecord(&client, allocator, session.did, "pub.leaflet.document", tid.str(), leaflet_record); 188 - ``` 189 - 190 - ## complexity estimate 191 - 192 - | component | complexity | notes | 193 - |-----------|------------|-------| 194 - | block parsing | medium | regex-free, line-by-line | 195 - | facet extraction | medium | byte index tracking is fiddly | 196 - | image upload | low | already have blob upload in xrpc | 197 - | json serialization | low | std.json handles it | 198 - | integration | low | add to existing publish flow | 199 - 200 - total: ~300-500 lines of zig 201 - 202 - ## open questions 203 - 204 - 1. **publication record**: do we need a `pub.leaflet.publication` too, or just documents? 205 - - leaflet allows standalone documents without publications 206 - - could skip publication for now 207 - 208 - 2. **image handling**: 209 - - option A: skip images initially (just text content) 210 - - option B: require images to be URLs (no blob upload) 211 - - option C: full blob upload support 212 - 213 - 3. **deduplication**: same rkey for both record types? 214 - - pro: easy to correlate 215 - - con: different collections, might not matter 216 - 217 - 4. **validation**: leaflet has a validate endpoint 218 - - could call `/api/unstable_validate` to check records before publish 219 - - probably skip for v1 220 - 221 - ## references 222 - 223 - - [pub.leaflet.document schema](/tmp/leaflet/lexicons/pub/leaflet/document.json) 224 - - [leaflet publishToPublication.ts](/tmp/leaflet/actions/publishToPublication.ts) - how leaflet creates records 225 - - [site.standard.document schema](/tmp/standard.site/app/data/lexicons/document.json) 226 - - paul's site: fetches records, doesn't publish them

-272

docs/scratch/logfire-zig-adoption.md

··· 1 - # logfire-zig adoption guide for leaflet-search 2 - 3 - guide for integrating logfire-zig into the leaflet-search backend. 4 - 5 - ## 1. add dependency 6 - 7 - in `backend/build.zig.zon`: 8 - 9 - ```zig 10 - .dependencies = .{ 11 - // ... existing deps ... 12 - .logfire = .{ 13 - .url = "https://tangled.sh/zzstoatzz.io/logfire-zig/archive/main", 14 - .hash = "...", // run zig build to get hash 15 - }, 16 - }, 17 - ``` 18 - 19 - in `backend/build.zig`, add the import: 20 - 21 - ```zig 22 - const logfire = b.dependency("logfire", .{ 23 - .target = target, 24 - .optimize = optimize, 25 - }); 26 - exe.root_module.addImport("logfire", logfire.module("logfire")); 27 - ``` 28 - 29 - ## 2. configure in main.zig 30 - 31 - ```zig 32 - const std = @import("std"); 33 - const logfire = @import("logfire"); 34 - // ... other imports ... 35 - 36 - pub fn main() !void { 37 - var gpa = std.heap.GeneralPurposeAllocator(.{}){}; 38 - defer _ = gpa.deinit(); 39 - const allocator = gpa.allocator(); 40 - 41 - // configure logfire early 42 - // reads LOGFIRE_WRITE_TOKEN from env automatically 43 - const lf = try logfire.configure(.{ 44 - .service_name = "leaflet-search", 45 - .service_version = "0.0.1", 46 - .environment = std.posix.getenv("FLY_APP_NAME") orelse "development", 47 - }); 48 - defer lf.shutdown(); 49 - 50 - logfire.info("starting leaflet-search on port {d}", .{port}); 51 - 52 - // ... rest of main ... 53 - } 54 - ``` 55 - 56 - ## 3. replace timing.zig with spans 57 - 58 - current pattern in server.zig: 59 - 60 - ```zig 61 - fn handleSearch(request: *http.Server.Request, target: []const u8) !void { 62 - const start_time = std.time.microTimestamp(); 63 - defer timing.record(.search, start_time); 64 - // ... 65 - } 66 - ``` 67 - 68 - with logfire: 69 - 70 - ```zig 71 - fn handleSearch(request: *http.Server.Request, target: []const u8) !void { 72 - const span = logfire.span("search.handle", .{}); 73 - defer span.end(); 74 - 75 - // parse params 76 - const query = parseQueryParam(alloc, target, "q") catch ""; 77 - 78 - // add attributes after parsing 79 - span.setAttribute("query", query); 80 - span.setAttribute("tag", tag_filter orelse ""); 81 - 82 - // ... 83 - } 84 - ``` 85 - 86 - for nested operations: 87 - 88 - ```zig 89 - fn search(alloc: Allocator, query: []const u8, ...) ![]Result { 90 - const span = logfire.span("search.execute", .{ 91 - .query_length = @intCast(query.len), 92 - }); 93 - defer span.end(); 94 - 95 - // FTS query 96 - { 97 - const fts_span = logfire.span("search.fts", .{}); 98 - defer fts_span.end(); 99 - // ... FTS logic ... 100 - } 101 - 102 - // vector search fallback 103 - if (results.len < limit) { 104 - const vec_span = logfire.span("search.vector", .{}); 105 - defer vec_span.end(); 106 - // ... vector search ... 107 - } 108 - 109 - return results; 110 - } 111 - ``` 112 - 113 - ## 4. add structured logging 114 - 115 - replace `std.debug.print` with logfire: 116 - 117 - ```zig 118 - // before 119 - std.debug.print("accept error: {}\n", .{err}); 120 - 121 - // after 122 - logfire.err("accept error: {}", .{err}); 123 - ``` 124 - 125 - ```zig 126 - // before 127 - std.debug.print("{s} listening on http://0.0.0.0:{d}\n", .{app_name, port}); 128 - 129 - // after 130 - logfire.info("{s} listening on port {d}", .{app_name, port}); 131 - ``` 132 - 133 - for sync operations in tap.zig: 134 - 135 - ```zig 136 - logfire.info("sync complete", .{}); 137 - logfire.debug("processed {d} events", .{event_count}); 138 - ``` 139 - 140 - for errors: 141 - 142 - ```zig 143 - logfire.err("turso query failed: {}", .{@errorName(err)}); 144 - ``` 145 - 146 - ## 5. add metrics 147 - 148 - replace stats.zig counters with logfire metrics: 149 - 150 - ```zig 151 - // before (in stats.zig) 152 - pub fn recordSearch(query: []const u8) void { 153 - total_searches.fetchAdd(1, .monotonic); 154 - // ... 155 - } 156 - 157 - // with logfire (in server.zig or stats.zig) 158 - pub fn recordSearch(query: []const u8) void { 159 - logfire.counter("search.total", 1); 160 - // existing logic... 161 - } 162 - ``` 163 - 164 - for gauges (e.g., active connections, document counts): 165 - 166 - ```zig 167 - logfire.gaugeInt("documents.indexed", doc_count); 168 - logfire.gaugeInt("connections.active", active_count); 169 - ``` 170 - 171 - for latency histograms (more detail than counter): 172 - 173 - ```zig 174 - // after search completes 175 - logfire.metric(.{ 176 - .name = "search.latency_ms", 177 - .unit = "ms", 178 - .data = .{ 179 - .histogram = .{ 180 - .data_points = &[_]logfire.HistogramDataPoint{.{ 181 - .start_time_ns = start_ns, 182 - .time_ns = std.time.nanoTimestamp(), 183 - .count = 1, 184 - .sum = latency_ms, 185 - .bucket_counts = ..., 186 - .explicit_bounds = ..., 187 - .min = latency_ms, 188 - .max = latency_ms, 189 - }}, 190 - }, 191 - }, 192 - }); 193 - ``` 194 - 195 - ## 6. deployment 196 - 197 - add to fly.toml secrets: 198 - 199 - ```bash 200 - fly secrets set LOGFIRE_WRITE_TOKEN=pylf_v1_us_xxxxx --app leaflet-search-backend 201 - ``` 202 - 203 - logfire-zig reads from `LOGFIRE_WRITE_TOKEN` or `LOGFIRE_TOKEN` automatically. 204 - 205 - ## 7. what to keep from existing code 206 - 207 - **keep timing.zig** - it provides local latency histograms for the dashboard API. logfire spans complement this with distributed tracing. 208 - 209 - **keep stats.zig** - local counters are still useful for the `/stats` endpoint. logfire metrics add remote observability. 210 - 211 - **keep activity.zig** - tracks recent activity for the dashboard. orthogonal to logfire. 212 - 213 - the pattern is: local state for dashboard UI, logfire for observability. 214 - 215 - ## 8. migration order 216 - 217 - 1. add dependency, configure in main.zig 218 - 2. add spans to request handlers (search, similar, tags, popular) 219 - 3. add structured logging for errors and important events 220 - 4. add metrics for key counters 221 - 5. gradually replace `std.debug.print` with logfire logging 222 - 6. consider removing timing.zig if logfire histograms are sufficient 223 - 224 - ## 9. example: full search handler 225 - 226 - ```zig 227 - fn handleSearch(request: *http.Server.Request, target: []const u8) !void { 228 - const span = logfire.span("http.search", .{}); 229 - defer span.end(); 230 - 231 - var arena = std.heap.ArenaAllocator.init(std.heap.page_allocator); 232 - defer arena.deinit(); 233 - const alloc = arena.allocator(); 234 - 235 - const query = parseQueryParam(alloc, target, "q") catch ""; 236 - const tag_filter = parseQueryParam(alloc, target, "tag") catch null; 237 - 238 - if (query.len == 0 and tag_filter == null) { 239 - logfire.debug("empty search request", .{}); 240 - try sendJson(request, "{\"error\":\"enter a search term\"}"); 241 - return; 242 - } 243 - 244 - const results = search.search(alloc, query, tag_filter, null, null) catch |err| { 245 - logfire.err("search failed: {}", .{@errorName(err)}); 246 - stats.recordError(); 247 - return err; 248 - }; 249 - 250 - logfire.counter("search.requests", 1); 251 - logfire.info("search completed", .{}); 252 - 253 - // ... send response ... 254 - } 255 - ``` 256 - 257 - ## 10. verifying it works 258 - 259 - run locally: 260 - 261 - ```bash 262 - LOGFIRE_WRITE_TOKEN=pylf_v1_us_xxx zig build run 263 - ``` 264 - 265 - check logfire dashboard for traces from `leaflet-search` service. 266 - 267 - without token (console fallback): 268 - 269 - ```bash 270 - zig build run 271 - # prints [span], [info], [metric] to stderr 272 - ```

-350

docs/scratch/standard-search-planning.md

··· 1 - # standard-search planning 2 - 3 - expanding leaflet-search to index all standard.site records. 4 - 5 - ## references 6 - 7 - - [standard.site](https://standard.site/) - shared lexicons for long-form publishing on ATProto 8 - - [leaflet.pub](https://leaflet.pub/) - implements `pub.leaflet.*` lexicons 9 - - [pckt.blog](https://pckt.blog/) - implements `blog.pckt.*` lexicons 10 - - [offprint.app](https://offprint.app/) - implements `app.offprint.*` lexicons 11 - - [ATProto docs](https://atproto.com/docs) - protocol documentation 12 - 13 - ## context 14 - 15 - discussion with pckt.blog team about building global search for standard.site ecosystem. 16 - current leaflet-search is tightly coupled to `pub.leaflet.*` lexicons. 17 - 18 - ### recent work (2026-01-05) 19 - 20 - added similarity cache to improve `/similar` endpoint performance: 21 - - `similarity_cache` table stores computed results keyed by `(source_uri, doc_count)` 22 - - cache auto-invalidates when document count changes 23 - - `/stats` endpoint now shows `cache_hits` and `cache_misses` 24 - - first request ~3s (cold), cached requests ~0.15s 25 - 26 - also added loading indicator for "related to" results in frontend. 27 - 28 - ### recent work (2026-01-06) 29 - 30 - - merged PR1: multi-platform schema (platform + source_collection columns) 31 - - added `loading.js` - portable loading state handler for dashboards 32 - - skeleton shimmer while loading 33 - - "waking up" toast after 2s threshold (fly.io cold start handling) 34 - - designed to be copied to other projects 35 - - fixed pluralization ("1 result" vs "2 results") 36 - 37 - ## what we know 38 - 39 - ### standard.site lexicons 40 - 41 - two shared lexicons for long-form publishing on ATProto: 42 - - `site.standard.document` - document content and metadata 43 - - `site.standard.publication` - publication/blog metadata 44 - 45 - implementing platforms: 46 - - leaflet.pub (`pub.leaflet.*`) 47 - - pckt.blog (`blog.pckt.*`) 48 - - offprint.app (`app.offprint.*`) 49 - 50 - ### site.standard.document schema 51 - 52 - examined real records from pckt.blog. key fields: 53 - 54 - ``` 55 - textContent - PRE-FLATTENED TEXT FOR SEARCH (the holy grail) 56 - content - platform-specific block structure 57 - .$type - identifies platform (e.g., "blog.pckt.content") 58 - title - document title 59 - tags - array of strings 60 - site - AT-URI reference to site.standard.publication 61 - path - URL path (e.g., "/my-post-abc123") 62 - publishedAt - ISO timestamp 63 - updatedAt - ISO timestamp 64 - coverImage - blob reference 65 - ``` 66 - 67 - ### the textContent field 68 - 69 - this is huge. platforms flatten their block content into a single text field: 70 - 71 - ```json 72 - { 73 - "content": { 74 - "$type": "blog.pckt.content", 75 - "items": [ /* platform-specific blocks */ ] 76 - }, 77 - "textContent": "i have been writing a lot of atproto things in zig!..." 78 - } 79 - ``` 80 - 81 - no need to parse platform-specific blocks - just index `textContent` directly. 82 - 83 - ### platform detection 84 - 85 - derive platform from `content.$type` prefix: 86 - - `blog.pckt.content` → pckt 87 - - `pub.leaflet.content` → leaflet (TBD - need to verify) 88 - - `app.offprint.content` → offprint (TBD - need to verify) 89 - 90 - ### current leaflet-search architecture 91 - 92 - ``` 93 - ATProto firehose (via tap) 94 - ↓ 95 - tap.zig - subscribes to pub.leaflet.document/publication 96 - ↓ 97 - indexer.zig - extracts content from nested pages[].blocks[] structure 98 - ↓ 99 - turso (sqlite) - documents table + FTS5 + embeddings 100 - ↓ 101 - search.zig - FTS5 queries + vector similarity 102 - ↓ 103 - server.zig - HTTP API (/search, /similar, /stats) 104 - ``` 105 - 106 - leaflet-specific code: 107 - - tap.zig lines 10-11: hardcoded collection names 108 - - tap.zig lines 234-268: block type extraction (pub.leaflet.blocks.*) 109 - - recursive page/block traversal logic 110 - 111 - generalizable code: 112 - - database schema (FTS5, tags, stats, similarity cache) 113 - - search/similar logic 114 - - HTTP API 115 - - embedding pipeline 116 - 117 - ## proposed architecture for standard-search 118 - 119 - ### ingestion changes 120 - 121 - subscribe to: 122 - - `site.standard.document` 123 - - `site.standard.publication` 124 - 125 - optionally also subscribe to platform-specific collections for richer data: 126 - - `pub.leaflet.document/publication` 127 - - `blog.pckt.document/publication` (if they have these) 128 - - `app.offprint.document/publication` (if they have these) 129 - 130 - ### content extraction 131 - 132 - for `site.standard.document`: 133 - 1. use `textContent` field directly - no block parsing! 134 - 2. fall back to title + description if textContent missing 135 - 136 - for platform-specific records (if needed): 137 - - keep existing leaflet block parser 138 - - add parsers for other platforms as needed 139 - 140 - ### database changes 141 - 142 - add to documents table: 143 - - `platform` TEXT - derived from content.$type (leaflet, pckt, offprint) 144 - - `source_collection` TEXT - the actual lexicon (site.standard.document, pub.leaflet.document) 145 - - `standard_uri` TEXT - if platform-specific record, link to corresponding site.standard.document 146 - 147 - ### API changes 148 - 149 - - `/search?q=...&platform=leaflet` - optional platform filter 150 - - results include `platform` field 151 - - `/similar` works across all platforms 152 - 153 - ### naming/deployment 154 - 155 - options: 156 - 1. rename leaflet-search → standard-search (breaking change) 157 - 2. new repo/deployment, keep leaflet-search as-is 158 - 3. branch and generalize, decide naming later 159 - 160 - leaning toward option 3 for now. 161 - 162 - ## findings from exploration 163 - 164 - ### pckt.blog - READY 165 - - writes `site.standard.document` records 166 - - has `textContent` field (pre-flattened) 167 - - `content.$type` = `blog.pckt.content` 168 - - 6+ records found on pckt.blog service account 169 - 170 - ### leaflet.pub - NOT YET MIGRATED 171 - - still using `pub.leaflet.document` only 172 - - no `site.standard.document` records found 173 - - no `textContent` field - content is in nested `pages[].blocks[]` 174 - - will need to continue parsing blocks OR wait for migration 175 - 176 - ### offprint.app - NOW INDEXED (2026-01-22) 177 - - writes `site.standard.document` records with `app.offprint.content` blocks 178 - - has `textContent` field (pre-flattened) 179 - - platform detected via basePath (`*.offprint.app`, `*.offprint.test`) 180 - - now fully supported alongside leaflet and pckt 181 - 182 - ### greengale.app - NOW INDEXED (2026-01-22) 183 - - writes `site.standard.document` records 184 - - has `textContent` field (pre-flattened) 185 - - platform detected via basePath (`greengale.app/*`) 186 - - ~29 documents indexed at time of discovery 187 - - now fully supported alongside leaflet, pckt, and offprint 188 - 189 - ### implication for architecture 190 - 191 - two paths: 192 - 193 - **path A: wait for leaflet migration** 194 - - simpler: just index `site.standard.document` with `textContent` 195 - - all platforms converge on same schema 196 - - downside: loses existing leaflet search until they migrate 197 - 198 - **path B: hybrid approach** 199 - - index `site.standard.document` (pckt, future leaflet, offprint) 200 - - ALSO index `pub.leaflet.document` with existing block parser 201 - - dedupe by URI or store both with `source_collection` indicator 202 - - more complex but maintains backwards compat 203 - 204 - leaning toward **path B** - can't lose 3500 leaflet docs. 205 - 206 - ## open questions 207 - 208 - - [x] does leaflet write site.standard.document records? **NO, not yet** 209 - - [x] does offprint write site.standard.document records? **UNKNOWN - no public content yet** 210 - - [ ] when will leaflet migrate to standard.site? 211 - - [ ] should we dedupe platform-specific vs standard records? 212 - - [ ] embeddings: regenerate for all, or use same model? 213 - 214 - ## implementation plan (PRs) 215 - 216 - breaking work into reviewable chunks: 217 - 218 - ### PR1: database schema for multi-platform ✅ MERGED 219 - - add `platform TEXT` column to documents (default 'leaflet') 220 - - add `source_collection TEXT` column (default 'pub.leaflet.document') 221 - - backfill existing ~3500 records 222 - - no behavior change, just schema prep 223 - - https://github.com/zzstoatzz/leaflet-search/pull/1 224 - 225 - ### PR2: generalized content extraction 226 - - new `extractor.zig` module with platform-agnostic interface 227 - - `textContent` extraction for standard.site records 228 - - keep existing block parser for `pub.leaflet.*` 229 - - platform detection from `content.$type` 230 - 231 - ### PR3: tap subscriber for site.standard.document 232 - - subscribe to `site.standard.document` + `site.standard.publication` 233 - - route to appropriate extractor 234 - - starts ingesting pckt.blog content 235 - 236 - ### PR4: API platform filter 237 - - add `?platform=` query param to `/search` 238 - - include `platform` field in results 239 - - frontend: show platform badge, optional filter 240 - 241 - ### PR5 (optional, separate track): witness cache 242 - - `witness_cache` table for raw records 243 - - replay tooling for backfills 244 - - independent of above work 245 - 246 - ## operational notes 247 - 248 - - **cloudflare pages**: `leaflet-search` does NOT auto-deploy from git. manual deploy required: 249 - ```bash 250 - wrangler pages deploy site --project-name leaflet-search 251 - ``` 252 - - **fly.io backend**: deploy from backend directory: 253 - ```bash 254 - cd backend && fly deploy 255 - ``` 256 - - **git remotes**: push to both `origin` (tangled.sh) and `github` (for MCP + PRs) 257 - 258 - ## next steps 259 - 260 - 1. ~~verify leaflet's site.standard.document structure~~ (done - they don't have any) 261 - 2. ~~find and examine offprint records~~ (done - no public content yet) 262 - 3. ~~PR1: database schema~~ (merged) 263 - 4. PR2: generalized content extraction 264 - 5. PR3: tap subscriber 265 - 6. PR4: API platform filter 266 - 7. consider witness cache architecture (see below) 267 - 268 - --- 269 - 270 - ## architectural consideration: witness cache 271 - 272 - [paul frazee's post on witness caches](https://bsky.app/profile/pfrazee.com/post/3lfarplxvcs2e) (2026-01-05): 273 - 274 - > I'm increasingly convinced that many Atmosphere backends start with a local "witness cache" of the repositories. 275 - > 276 - > A witness cache is a copy of the repository records, plus a timestamp of when the record was indexed (the "witness time") which you want to keep 277 - > 278 - > The key feature is: you can replay it 279 - 280 - > With local replay, you can add new tables or indexes to your backend and quickly backfill the data. If you don't have a witness cache, you would have to do backfill from the network, which is slow 281 - 282 - ### current leaflet-search architecture (no witness cache) 283 - 284 - ``` 285 - Firehose → tap → Parse & Transform → Store DERIVED data → Discard raw record 286 - ``` 287 - 288 - we store: 289 - - `uri`, `did`, `rkey` 290 - - `title` (extracted) 291 - - `content` (flattened from blocks) 292 - - `created_at`, `publication_uri` 293 - 294 - we discard: the raw record JSON 295 - 296 - ### witness cache architecture 297 - 298 - ``` 299 - Firehose → Store RAW record + witness_time → Derive indexes on demand (replayable) 300 - ``` 301 - 302 - would store: 303 - - `uri`, `collection`, `rkey` 304 - - `raw_record` (full JSON blob) 305 - - `witness_time` (when we indexed it) 306 - 307 - then derive FTS, embeddings, etc. from local data via replay. 308 - 309 - ### comparison 310 - 311 - | scenario | current (no cache) | with witness cache | 312 - |----------|-------------------|-------------------| 313 - | add new parser (offprint) | re-crawl network | replay local | 314 - | leaflet adds textContent | wait for new records | replay & re-extract | 315 - | fix parsing bug | re-crawl affected | replay & re-derive | 316 - | change embedding model | re-fetch content | replay local | 317 - | add new index/table | backfill from network | replay locally | 318 - 319 - ### trade-offs 320 - 321 - **storage cost:** 322 - - ~3500 docs × ~10KB avg = ~35MB (not huge) 323 - - turso free tier: 9GB, so plenty of room 324 - 325 - **complexity:** 326 - - two-phase: store raw, then derive 327 - - vs current one-phase: derive immediately 328 - 329 - **benefits for standard-search:** 330 - - could add offprint/pckt parsers and replay existing data 331 - - when leaflet migrates to standard.site, re-derive without network 332 - - embedding backfill becomes local-only (no voyage API for content fetch) 333 - 334 - ### implementation options 335 - 336 - 1. **add `raw_record TEXT` column to existing tables** 337 - - simple, backwards compatible 338 - - can migrate incrementally 339 - 340 - 2. **separate `witness_cache` table** 341 - - `(uri PRIMARY KEY, collection, raw_record, witness_time)` 342 - - cleaner separation of concerns 343 - - documents/publications tables become derived views 344 - 345 - 3. **use duckdb/clickhouse for witness cache** (paul's suggestion) 346 - - better compression for JSON blobs 347 - - good for analytics queries 348 - - adds operational complexity 349 - 350 - for our scale, option 1 or 2 with turso is probably fine.

-124

docs/search-architecture.md

··· 1 - # search architecture 2 - 3 - current state, rationale, and future options. 4 - 5 - ## current: SQLite FTS5 6 - 7 - we use SQLite's built-in full-text search (FTS5) via Turso. 8 - 9 - ### why FTS5 works for now 10 - 11 - - **scale**: ~3500 documents. FTS5 handles this trivially. 12 - - **latency**: 10-50ms for search queries. fine for our use case. 13 - - **cost**: $0. included with Turso free tier. 14 - - **ops**: zero. no separate service to run. 15 - - **simplicity**: one database for everything (docs, FTS, vectors, cache). 16 - 17 - ### how it works 18 - 19 - ``` 20 - user query: "crypto-casino" 21 - ↓ 22 - buildFtsQuery(): "crypto OR casino*" 23 - ↓ 24 - FTS5 MATCH query with BM25 + recency decay 25 - ↓ 26 - results with snippet() 27 - ``` 28 - 29 - key decisions: 30 - - **OR between terms** for better recall (deliberate, see commit 35ad4b5) 31 - - **prefix match on last word** for type-ahead feel 32 - - **unicode61 tokenizer** splits on non-alphanumeric (we match this in buildFtsQuery) 33 - - **recency decay** boosts recent docs: `ORDER BY rank + (days_old / 30)` 34 - 35 - ### what's coupled to FTS5 36 - 37 - all in `backend/src/search.zig`: 38 - 39 - | component | FTS5-specific | 40 - |-----------|---------------| 41 - | 10 query definitions | `MATCH`, `snippet()`, `ORDER BY rank` | 42 - | `buildFtsQuery()` | constructs FTS5 syntax | 43 - | schema | `documents_fts`, `publications_fts` virtual tables | 44 - 45 - ### what's already decoupled 46 - 47 - - result types (`SearchResultJson`, `Doc`, `Pub`) 48 - - similarity search (uses `vector_distance_cos`, not FTS5) 49 - - caching logic 50 - - HTTP layer (server.zig just calls `search()`) 51 - 52 - ### known limitations 53 - 54 - - **no typo tolerance**: "leafet" won't find "leaflet" 55 - - **no relevance tuning**: can't boost title vs content 56 - - **single writer**: SQLite write lock 57 - - **no horizontal scaling**: single database 58 - 59 - these aren't problems at current scale. 60 - 61 - ## future: if we need to scale 62 - 63 - ### when to consider switching 64 - 65 - - search latency consistently >100ms 66 - - write contention from indexing 67 - - need typo tolerance or better relevance 68 - - millions of documents 69 - 70 - ### recommended: Elasticsearch 71 - 72 - Elasticsearch is the battle-tested choice for production search: 73 - 74 - - proven at massive scale (Wikipedia, GitHub, Stack Overflow) 75 - - rich query DSL, analyzers, aggregations 76 - - typo tolerance via fuzzy matching 77 - - horizontal scaling built-in 78 - - extensive tooling and community 79 - 80 - trade-offs: 81 - - operational complexity (JVM, cluster management) 82 - - resource hungry (~2GB+ RAM minimum) 83 - - cost: $50-500/month depending on scale 84 - 85 - ### alternatives considered 86 - 87 - **Meilisearch/Typesense**: simpler, lighter, great defaults. good for straightforward search but less proven at scale. would work fine for this use case but Elasticsearch has more headroom. 88 - 89 - **Algolia**: fully managed, excellent but expensive. makes sense if you want zero ops. 90 - 91 - **PostgreSQL full-text**: if already on Postgres. not as good as FTS5 or Elasticsearch but one less system. 92 - 93 - ### migration path 94 - 95 - 1. keep Turso as source of truth 96 - 2. add Elasticsearch as search index 97 - 3. sync documents to ES on write (async) 98 - 4. point `/search` at Elasticsearch 99 - 5. keep `/similar` on Turso (vector search) 100 - 101 - the `search()` function would change from SQL queries to ES client calls. result types stay the same. HTTP layer unchanged. 102 - 103 - estimated effort: 1-2 days to swap search backend. 104 - 105 - ### vector search scaling 106 - 107 - similarity search currently uses brute-force `vector_distance_cos` with caching. at scale: 108 - 109 - - **Elasticsearch**: has vector search (dense_vector + kNN) 110 - - **dedicated vector DB**: Qdrant, Pinecone, Weaviate 111 - - **pgvector**: if on Postgres 112 - 113 - could consolidate text + vector in Elasticsearch, or keep them separate. 114 - 115 - ## summary 116 - 117 - | scale | recommendation | 118 - |-------|----------------| 119 - | <10k docs | keep FTS5 (current) | 120 - | 10k-100k docs | still probably fine, monitor latency | 121 - | 100k+ docs | consider Elasticsearch | 122 - | millions + sub-ms latency | Elasticsearch cluster + caching layer | 123 - 124 - we're in the "keep FTS5" zone. the code is structured to swap later if needed.

+343

docs/standard-search-planning.md

··· 1 + # standard-search planning 2 + 3 + expanding leaflet-search to index all standard.site records. 4 + 5 + ## references 6 + 7 + - [standard.site](https://standard.site/) - shared lexicons for long-form publishing on ATProto 8 + - [leaflet.pub](https://leaflet.pub/) - implements `pub.leaflet.*` lexicons 9 + - [pckt.blog](https://pckt.blog/) - implements `blog.pckt.*` lexicons 10 + - [offprint.app](https://offprint.app/) - implements `app.offprint.*` lexicons (early beta) 11 + - [ATProto docs](https://atproto.com/docs) - protocol documentation 12 + 13 + ## context 14 + 15 + discussion with pckt.blog team about building global search for standard.site ecosystem. 16 + current leaflet-search is tightly coupled to `pub.leaflet.*` lexicons. 17 + 18 + ### recent work (2026-01-05) 19 + 20 + added similarity cache to improve `/similar` endpoint performance: 21 + - `similarity_cache` table stores computed results keyed by `(source_uri, doc_count)` 22 + - cache auto-invalidates when document count changes 23 + - `/stats` endpoint now shows `cache_hits` and `cache_misses` 24 + - first request ~3s (cold), cached requests ~0.15s 25 + 26 + also added loading indicator for "related to" results in frontend. 27 + 28 + ### recent work (2026-01-06) 29 + 30 + - merged PR1: multi-platform schema (platform + source_collection columns) 31 + - added `loading.js` - portable loading state handler for dashboards 32 + - skeleton shimmer while loading 33 + - "waking up" toast after 2s threshold (fly.io cold start handling) 34 + - designed to be copied to other projects 35 + - fixed pluralization ("1 result" vs "2 results") 36 + 37 + ## what we know 38 + 39 + ### standard.site lexicons 40 + 41 + two shared lexicons for long-form publishing on ATProto: 42 + - `site.standard.document` - document content and metadata 43 + - `site.standard.publication` - publication/blog metadata 44 + 45 + implementing platforms: 46 + - leaflet.pub (`pub.leaflet.*`) 47 + - pckt.blog (`blog.pckt.*`) 48 + - offprint.app (`app.offprint.*`) 49 + 50 + ### site.standard.document schema 51 + 52 + examined real records from pckt.blog. key fields: 53 + 54 + ``` 55 + textContent - PRE-FLATTENED TEXT FOR SEARCH (the holy grail) 56 + content - platform-specific block structure 57 + .$type - identifies platform (e.g., "blog.pckt.content") 58 + title - document title 59 + tags - array of strings 60 + site - AT-URI reference to site.standard.publication 61 + path - URL path (e.g., "/my-post-abc123") 62 + publishedAt - ISO timestamp 63 + updatedAt - ISO timestamp 64 + coverImage - blob reference 65 + ``` 66 + 67 + ### the textContent field 68 + 69 + this is huge. platforms flatten their block content into a single text field: 70 + 71 + ```json 72 + { 73 + "content": { 74 + "$type": "blog.pckt.content", 75 + "items": [ /* platform-specific blocks */ ] 76 + }, 77 + "textContent": "i have been writing a lot of atproto things in zig!..." 78 + } 79 + ``` 80 + 81 + no need to parse platform-specific blocks - just index `textContent` directly. 82 + 83 + ### platform detection 84 + 85 + derive platform from `content.$type` prefix: 86 + - `blog.pckt.content` → pckt 87 + - `pub.leaflet.content` → leaflet (TBD - need to verify) 88 + - `app.offprint.content` → offprint (TBD - need to verify) 89 + 90 + ### current leaflet-search architecture 91 + 92 + ``` 93 + ATProto firehose (via tap) 94 + ↓ 95 + tap.zig - subscribes to pub.leaflet.document/publication 96 + ↓ 97 + indexer.zig - extracts content from nested pages[].blocks[] structure 98 + ↓ 99 + turso (sqlite) - documents table + FTS5 + embeddings 100 + ↓ 101 + search.zig - FTS5 queries + vector similarity 102 + ↓ 103 + server.zig - HTTP API (/search, /similar, /stats) 104 + ``` 105 + 106 + leaflet-specific code: 107 + - tap.zig lines 10-11: hardcoded collection names 108 + - tap.zig lines 234-268: block type extraction (pub.leaflet.blocks.*) 109 + - recursive page/block traversal logic 110 + 111 + generalizable code: 112 + - database schema (FTS5, tags, stats, similarity cache) 113 + - search/similar logic 114 + - HTTP API 115 + - embedding pipeline 116 + 117 + ## proposed architecture for standard-search 118 + 119 + ### ingestion changes 120 + 121 + subscribe to: 122 + - `site.standard.document` 123 + - `site.standard.publication` 124 + 125 + optionally also subscribe to platform-specific collections for richer data: 126 + - `pub.leaflet.document/publication` 127 + - `blog.pckt.document/publication` (if they have these) 128 + - `app.offprint.document/publication` (if they have these) 129 + 130 + ### content extraction 131 + 132 + for `site.standard.document`: 133 + 1. use `textContent` field directly - no block parsing! 134 + 2. fall back to title + description if textContent missing 135 + 136 + for platform-specific records (if needed): 137 + - keep existing leaflet block parser 138 + - add parsers for other platforms as needed 139 + 140 + ### database changes 141 + 142 + add to documents table: 143 + - `platform` TEXT - derived from content.$type (leaflet, pckt, offprint) 144 + - `source_collection` TEXT - the actual lexicon (site.standard.document, pub.leaflet.document) 145 + - `standard_uri` TEXT - if platform-specific record, link to corresponding site.standard.document 146 + 147 + ### API changes 148 + 149 + - `/search?q=...&platform=leaflet` - optional platform filter 150 + - results include `platform` field 151 + - `/similar` works across all platforms 152 + 153 + ### naming/deployment 154 + 155 + options: 156 + 1. rename leaflet-search → standard-search (breaking change) 157 + 2. new repo/deployment, keep leaflet-search as-is 158 + 3. branch and generalize, decide naming later 159 + 160 + leaning toward option 3 for now. 161 + 162 + ## findings from exploration 163 + 164 + ### pckt.blog - READY 165 + - writes `site.standard.document` records 166 + - has `textContent` field (pre-flattened) 167 + - `content.$type` = `blog.pckt.content` 168 + - 6+ records found on pckt.blog service account 169 + 170 + ### leaflet.pub - NOT YET MIGRATED 171 + - still using `pub.leaflet.document` only 172 + - no `site.standard.document` records found 173 + - no `textContent` field - content is in nested `pages[].blocks[]` 174 + - will need to continue parsing blocks OR wait for migration 175 + 176 + ### offprint.app - LIKELY EARLY BETA 177 + - no `site.standard.document` records found on offprint.app account 178 + - no `app.offprint.document` collection visible 179 + - website shows no example users/content 180 + - probably in early/private beta - no public records yet 181 + 182 + ### implication for architecture 183 + 184 + two paths: 185 + 186 + **path A: wait for leaflet migration** 187 + - simpler: just index `site.standard.document` with `textContent` 188 + - all platforms converge on same schema 189 + - downside: loses existing leaflet search until they migrate 190 + 191 + **path B: hybrid approach** 192 + - index `site.standard.document` (pckt, future leaflet, offprint) 193 + - ALSO index `pub.leaflet.document` with existing block parser 194 + - dedupe by URI or store both with `source_collection` indicator 195 + - more complex but maintains backwards compat 196 + 197 + leaning toward **path B** - can't lose 3500 leaflet docs. 198 + 199 + ## open questions 200 + 201 + - [x] does leaflet write site.standard.document records? **NO, not yet** 202 + - [x] does offprint write site.standard.document records? **UNKNOWN - no public content yet** 203 + - [ ] when will leaflet migrate to standard.site? 204 + - [ ] should we dedupe platform-specific vs standard records? 205 + - [ ] embeddings: regenerate for all, or use same model? 206 + 207 + ## implementation plan (PRs) 208 + 209 + breaking work into reviewable chunks: 210 + 211 + ### PR1: database schema for multi-platform ✅ MERGED 212 + - add `platform TEXT` column to documents (default 'leaflet') 213 + - add `source_collection TEXT` column (default 'pub.leaflet.document') 214 + - backfill existing ~3500 records 215 + - no behavior change, just schema prep 216 + - https://github.com/zzstoatzz/leaflet-search/pull/1 217 + 218 + ### PR2: generalized content extraction 219 + - new `extractor.zig` module with platform-agnostic interface 220 + - `textContent` extraction for standard.site records 221 + - keep existing block parser for `pub.leaflet.*` 222 + - platform detection from `content.$type` 223 + 224 + ### PR3: TAP subscriber for site.standard.document 225 + - subscribe to `site.standard.document` + `site.standard.publication` 226 + - route to appropriate extractor 227 + - starts ingesting pckt.blog content 228 + 229 + ### PR4: API platform filter 230 + - add `?platform=` query param to `/search` 231 + - include `platform` field in results 232 + - frontend: show platform badge, optional filter 233 + 234 + ### PR5 (optional, separate track): witness cache 235 + - `witness_cache` table for raw records 236 + - replay tooling for backfills 237 + - independent of above work 238 + 239 + ## operational notes 240 + 241 + - **cloudflare pages**: `leaflet-search` does NOT auto-deploy from git. manual deploy required: 242 + ```bash 243 + wrangler pages deploy site --project-name leaflet-search 244 + ``` 245 + - **fly.io backend**: deploy from backend directory: 246 + ```bash 247 + cd backend && fly deploy 248 + ``` 249 + - **git remotes**: push to both `origin` (tangled.sh) and `github` (for MCP + PRs) 250 + 251 + ## next steps 252 + 253 + 1. ~~verify leaflet's site.standard.document structure~~ (done - they don't have any) 254 + 2. ~~find and examine offprint records~~ (done - no public content yet) 255 + 3. ~~PR1: database schema~~ (merged) 256 + 4. PR2: generalized content extraction 257 + 5. PR3: TAP subscriber 258 + 6. PR4: API platform filter 259 + 7. consider witness cache architecture (see below) 260 + 261 + --- 262 + 263 + ## architectural consideration: witness cache 264 + 265 + [paul frazee's post on witness caches](https://bsky.app/profile/pfrazee.com/post/3lfarplxvcs2e) (2026-01-05): 266 + 267 + > I'm increasingly convinced that many Atmosphere backends start with a local "witness cache" of the repositories. 268 + > 269 + > A witness cache is a copy of the repository records, plus a timestamp of when the record was indexed (the "witness time") which you want to keep 270 + > 271 + > The key feature is: you can replay it 272 + 273 + > With local replay, you can add new tables or indexes to your backend and quickly backfill the data. If you don't have a witness cache, you would have to do backfill from the network, which is slow 274 + 275 + ### current leaflet-search architecture (no witness cache) 276 + 277 + ``` 278 + Firehose → TAP → Parse & Transform → Store DERIVED data → Discard raw record 279 + ``` 280 + 281 + we store: 282 + - `uri`, `did`, `rkey` 283 + - `title` (extracted) 284 + - `content` (flattened from blocks) 285 + - `created_at`, `publication_uri` 286 + 287 + we discard: the raw record JSON 288 + 289 + ### witness cache architecture 290 + 291 + ``` 292 + Firehose → Store RAW record + witness_time → Derive indexes on demand (replayable) 293 + ``` 294 + 295 + would store: 296 + - `uri`, `collection`, `rkey` 297 + - `raw_record` (full JSON blob) 298 + - `witness_time` (when we indexed it) 299 + 300 + then derive FTS, embeddings, etc. from local data via replay. 301 + 302 + ### comparison 303 + 304 + | scenario | current (no cache) | with witness cache | 305 + |----------|-------------------|-------------------| 306 + | add new parser (offprint) | re-crawl network | replay local | 307 + | leaflet adds textContent | wait for new records | replay & re-extract | 308 + | fix parsing bug | re-crawl affected | replay & re-derive | 309 + | change embedding model | re-fetch content | replay local | 310 + | add new index/table | backfill from network | replay locally | 311 + 312 + ### trade-offs 313 + 314 + **storage cost:** 315 + - ~3500 docs × ~10KB avg = ~35MB (not huge) 316 + - turso free tier: 9GB, so plenty of room 317 + 318 + **complexity:** 319 + - two-phase: store raw, then derive 320 + - vs current one-phase: derive immediately 321 + 322 + **benefits for standard-search:** 323 + - could add offprint/pckt parsers and replay existing data 324 + - when leaflet migrates to standard.site, re-derive without network 325 + - embedding backfill becomes local-only (no voyage API for content fetch) 326 + 327 + ### implementation options 328 + 329 + 1. **add `raw_record TEXT` column to existing tables** 330 + - simple, backwards compatible 331 + - can migrate incrementally 332 + 333 + 2. **separate `witness_cache` table** 334 + - `(uri PRIMARY KEY, collection, raw_record, witness_time)` 335 + - cleaner separation of concerns 336 + - documents/publications tables become derived views 337 + 338 + 3. **use duckdb/clickhouse for witness cache** (paul's suggestion) 339 + - better compression for JSON blobs 340 + - good for analytics queries 341 + - adds operational complexity 342 + 343 + for our scale, option 1 or 2 with turso is probably fine.

-215

docs/tap.md

··· 1 - # tap (firehose sync) 2 - 3 - leaflet-search uses [tap](https://github.com/bluesky-social/indigo/tree/main/cmd/tap) from bluesky-social/indigo to receive real-time events from the ATProto firehose. 4 - 5 - ## what is tap? 6 - 7 - tap subscribes to the ATProto firehose, filters for specific collections (e.g., `site.standard.document`), and broadcasts matching events to websocket clients. it also does initial crawling/backfilling of existing records. 8 - 9 - key behavior: **tap backfills historical data when repos are added**. when a repo is added to tracking: 10 - 1. tap fetches the full repo from the account's PDS using `com.atproto.sync.getRepo` 11 - 2. live firehose events during backfill are buffered in memory 12 - 3. historical events (marked `live: false`) are delivered first 13 - 4. after historical events complete, buffered live events are released 14 - 5. subsequent firehose events arrive immediately marked as `live: true` 15 - 16 - tap enforces strict per-repo ordering - live events are synchronization barriers that require all prior events to complete first. 17 - 18 - ## message format 19 - 20 - tap sends JSON messages over websocket. record events look like: 21 - 22 - ```json 23 - { 24 - "type": "record", 25 - "record": { 26 - "live": true, 27 - "did": "did:plc:abc123...", 28 - "rev": "3mbspmpaidl2a", 29 - "collection": "site.standard.document", 30 - "rkey": "3lzyrj6q6gs27", 31 - "action": "create", 32 - "record": { ... }, 33 - "cid": "bafyrei..." 34 - } 35 - } 36 - ``` 37 - 38 - ### field types (important!) 39 - 40 - | field | type | values | notes | 41 - |-------|------|--------|-------| 42 - | type | string | "record", "identity", "account" | message type | 43 - | action | **string** | "create", "update", "delete" | NOT an enum! | 44 - | live | bool | true/false | true = firehose, false = resync | 45 - | collection | string | e.g., "site.standard.document" | lexicon collection | 46 - 47 - ## gotchas 48 - 49 - 1. **action is a string, not an enum** - tap sends `"action": "create"` as a JSON string. if your parser expects an enum type, extraction will silently fail. use string comparison. 50 - 51 - 2. **collection filters apply during processing** - `TAP_COLLECTION_FILTERS` controls which records tap processes and sends to clients, both during live commits and resync CAR walks. records from other collections are skipped entirely. 52 - 53 - 3. **signal collection vs collection filters** - `TAP_SIGNAL_COLLECTION` controls auto-discovery of repos (which repos to track), while `TAP_COLLECTION_FILTERS` controls which records from those repos to output. a repo must either be auto-discovered via signal collection OR manually added via `/repos/add`. 54 - 55 - 4. **silent extraction failures** - if using zat's `extractAt`, enable debug logging to see why parsing fails: 56 - ```zig 57 - pub const std_options = .{ 58 - .log_scope_levels = &.{.{ .scope = .zat, .level = .debug }}, 59 - }; 60 - ``` 61 - this will show messages like: 62 - ``` 63 - debug(zat): extractAt: parse failed for Op at path { "op" }: InvalidEnumTag 64 - ``` 65 - 66 - ## memory and performance tuning 67 - 68 - tap loads **entire repo CARs into memory** during resync. some bsky users have repos that are 100-300MB+. this causes spiky memory usage that can OOM the machine. 69 - 70 - ### recommended settings for leaflet-search 71 - 72 - ```toml 73 - [[vm]] 74 - memory = '2gb' # 1gb is not enough 75 - 76 - [env] 77 - TAP_RESYNC_PARALLELISM = '1' # only one repo CAR in memory at a time (default: 5) 78 - TAP_FIREHOSE_PARALLELISM = '5' # concurrent event processors (default: 10) 79 - TAP_OUTBOX_CAPACITY = '10000' # event buffer size (default: 100000) 80 - TAP_IDENT_CACHE_SIZE = '10000' # identity cache entries (default: 2000000) 81 - ``` 82 - 83 - ### why these values? 84 - 85 - - **2GB memory**: 1GB causes OOM kills when resyncing large repos 86 - - **resync parallelism 1**: prevents multiple large CARs in memory simultaneously 87 - - **lower firehose/outbox**: we track ~1000 repos, not millions - defaults are overkill 88 - - **smaller ident cache**: we don't need 2M cached identities 89 - 90 - if tap keeps OOM'ing, check logs for large repo resyncs: 91 - ```bash 92 - fly logs -a leaflet-search-tap | grep "parsing repo CAR" | grep -E "size\":[0-9]{8,}" 93 - ``` 94 - 95 - ## quick status check 96 - 97 - from the `tap/` directory: 98 - ```bash 99 - just check 100 - ``` 101 - 102 - shows tap machine state, most recent indexed date, and 7-day timeline. useful for verifying indexing is working after restarts. 103 - 104 - example output: 105 - ``` 106 - === tap status === 107 - app 781417db604d48 23 ewr started ... 108 - 109 - === Recent Indexing Activity === 110 - Last indexed: 2026-01-08 (14 docs) 111 - Today: 2026-01-11 112 - Docs: 3742 | Pubs: 1231 113 - 114 - === Timeline (last 7 days) === 115 - 2026-01-08: 14 docs 116 - 2026-01-07: 29 docs 117 - ... 118 - ``` 119 - 120 - if "Last indexed" is more than a day behind "Today", tap may be down or catching up. 121 - 122 - ## checking catch-up progress 123 - 124 - when tap restarts after downtime, it replays the firehose from its saved cursor. to check progress: 125 - 126 - ```bash 127 - # see current firehose position (look for timestamps in log messages) 128 - fly logs -a leaflet-search-tap | grep -E '"time".*"seq"' | tail -3 129 - ``` 130 - 131 - the `"time"` field in log messages shows how far behind tap is. compare to current time to estimate catch-up. 132 - 133 - catch-up speed varies: 134 - - **~0.3x** when resync queue is full (large repos being fetched) 135 - - **~1x or faster** once resyncs clear 136 - 137 - ## debugging 138 - 139 - ### check tap connection 140 - ```bash 141 - fly logs -a leaflet-search-tap --no-tail | tail -30 142 - ``` 143 - 144 - look for: 145 - - `"connected to firehose"` - successfully connected to bsky relay 146 - - `"websocket connected"` - backend connected to tap 147 - - `"dialing failed"` / `"i/o timeout"` - network issues 148 - 149 - ### check backend is receiving 150 - ```bash 151 - fly logs -a leaflet-search-backend --no-tail | grep -E "(tap|indexed)" 152 - ``` 153 - 154 - look for: 155 - - `tap connected!` - connected to tap 156 - - `tap: msg_type=record` - receiving messages 157 - - `indexed document:` - successfully processing 158 - 159 - ### common issues 160 - 161 - | symptom | cause | fix | 162 - |---------|-------|-----| 163 - | tap machine stopped, `oom_killed=true` | large repo CARs exhausted memory | increase memory to 2GB, reduce `TAP_RESYNC_PARALLELISM` to 1 | 164 - | `websocket handshake failed: error.Timeout` | tap not running or network issue | restart tap, check regions match | 165 - | `dialing failed: lookup ... i/o timeout` | DNS issues reaching bsky relay | restart tap, transient network issue | 166 - | messages received but not indexed | extraction failing (type mismatch) | enable zat debug logging, check field types | 167 - | repo shows `records: 0` after adding | resync failed or collection not in filters | check tap logs for resync errors, verify `TAP_COLLECTION_FILTERS` | 168 - | new platform records not appearing | platform's collection not in `TAP_COLLECTION_FILTERS` | add collection to filters, restart tap | 169 - | indexing stopped, tap shows "started" | tap catching up from downtime | check firehose position in logs, wait for catch-up | 170 - 171 - ## tap API endpoints 172 - 173 - tap exposes HTTP endpoints for monitoring and control: 174 - 175 - | endpoint | description | 176 - |----------|-------------| 177 - | `/health` | health check | 178 - | `/stats/repo-count` | number of tracked repos | 179 - | `/stats/record-count` | total records processed | 180 - | `/stats/outbox-buffer` | events waiting to be sent | 181 - | `/stats/resync-buffer` | buffered commits for repos currently resyncing (NOT the resync queue) | 182 - | `/stats/cursors` | firehose cursor position | 183 - | `/info/:did` | repo status: `{"did":"...","state":"active","records":N}` | 184 - | `/repos/add` | POST with `{"dids":["did:plc:..."]}` to add repos | 185 - | `/repos/remove` | POST with `{"dids":["did:plc:..."]}` to remove repos | 186 - 187 - example: check repo status 188 - ```bash 189 - fly ssh console -a leaflet-search-tap -C "curl -s localhost:2480/info/did:plc:abc123" 190 - ``` 191 - 192 - example: manually add a repo for backfill 193 - ```bash 194 - fly ssh console -a leaflet-search-tap -C 'curl -X POST -H "Content-Type: application/json" -d "{\"dids\":[\"did:plc:abc123\"]}" localhost:2480/repos/add' 195 - ``` 196 - 197 - ## fly.io deployment 198 - 199 - both tap and backend should be in the same region for internal networking: 200 - 201 - ```bash 202 - # check current regions 203 - fly status -a leaflet-search-tap 204 - fly status -a leaflet-search-backend 205 - 206 - # restart tap if needed 207 - fly machine restart -a leaflet-search-tap <machine-id> 208 - ``` 209 - 210 - note: changing `primary_region` in fly.toml only affects new machines. to move existing machines, clone to new region and destroy old one. 211 - 212 - ## references 213 - 214 - - [tap source (bluesky-social/indigo)](https://github.com/bluesky-social/indigo/tree/main/cmd/tap) 215 - - [ATProto firehose docs](https://atproto.com/specs/sync#firehose)

+5 -5

mcp/README.md

··· 1 - # pub search MCP 1 + # leaflet-mcp 2 2 3 - MCP server for [pub search](https://pub-search.waow.tech) - search ATProto publishing platforms (Leaflet, pckt, standard.site). 3 + MCP server for [Leaflet](https://leaflet.pub) - search decentralized publications on ATProto. 4 4 5 5 ## usage 6 6 7 7 ### hosted (recommended) 8 8 9 9 ```bash 10 - claude mcp add-json pub-search '{"type": "http", "url": "https://pub-search-by-zzstoatzz.fastmcp.app/mcp"}' 10 + claude mcp add-json leaflet '{"type": "http", "url": "https://leaflet-search-by-zzstoatzz.fastmcp.app/mcp"}' 11 11 ``` 12 12 13 13 ### local ··· 15 15 run the MCP server locally with `uvx`: 16 16 17 17 ```bash 18 - uvx --from git+https://github.com/zzstoatzz/leaflet-search#subdirectory=mcp pub-search 18 + uvx --from git+https://github.com/zzstoatzz/leaflet-search#subdirectory=mcp leaflet-mcp 19 19 ``` 20 20 21 21 to add it to claude code as a local stdio server: 22 22 23 23 ```bash 24 - claude mcp add pub-search -- uvx --from 'git+https://github.com/zzstoatzz/leaflet-search#subdirectory=mcp' pub-search 24 + claude mcp add leaflet -- uvx --from 'git+https://github.com/zzstoatzz/leaflet-search#subdirectory=mcp' leaflet-mcp 25 25 ``` 26 26 27 27 ## workflow

+5 -5

mcp/pyproject.toml

··· 1 1 [project] 2 - name = "pub-search" 2 + name = "leaflet-mcp" 3 3 dynamic = ["version"] 4 - description = "MCP server for searching ATProto publishing platforms (Leaflet, pckt, and more)" 4 + description = "MCP server for Leaflet - search decentralized publications on ATProto" 5 5 readme = "README.md" 6 6 authors = [{ name = "zzstoatzz", email = "thrast36@gmail.com" }] 7 7 requires-python = ">=3.10" 8 8 license = "MIT" 9 9 10 - keywords = ["pub-search", "mcp", "atproto", "publications", "search", "fastmcp", "leaflet", "pckt"] 10 + keywords = ["leaflet", "mcp", "atproto", "publications", "search", "fastmcp"] 11 11 12 12 classifiers = [ 13 13 "Development Status :: 3 - Alpha", ··· 27 27 ] 28 28 29 29 [project.scripts] 30 - pub-search = "pub_search.server:main" 30 + leaflet-mcp = "leaflet_mcp.server:main" 31 31 32 32 [build-system] 33 33 requires = ["hatchling", "uv-dynamic-versioning>=0.7.0"] 34 34 build-backend = "hatchling.build" 35 35 36 36 [tool.hatch.build.targets.wheel] 37 - packages = ["src/pub_search"] 37 + packages = ["src/leaflet_mcp"] 38 38 39 39 [tool.hatch.version] 40 40 source = "uv-dynamic-versioning"

-67

mcp/scripts/test_live.py

··· 1 - #!/usr/bin/env python3 2 - """Test the pub-search MCP server.""" 3 - 4 - import asyncio 5 - import sys 6 - 7 - from fastmcp import Client 8 - from fastmcp.client.transports import FastMCPTransport 9 - 10 - from pub_search.server import mcp 11 - 12 - 13 - async def main(): 14 - # use local transport for testing, or live URL if --live flag 15 - if "--live" in sys.argv: 16 - print("testing against live Horizon server...") 17 - client = Client("https://pub-search-by-zzstoatzz.fastmcp.app/mcp") 18 - else: 19 - print("testing locally with FastMCPTransport...") 20 - client = Client(transport=FastMCPTransport(mcp)) 21 - 22 - async with client: 23 - # list tools 24 - print("=== tools ===") 25 - tools = await client.list_tools() 26 - for t in tools: 27 - print(f" {t.name}") 28 - 29 - # test search with new platform filter 30 - print("\n=== search(query='zig', platform='leaflet', limit=3) ===") 31 - result = await client.call_tool( 32 - "search", {"query": "zig", "platform": "leaflet", "limit": 3} 33 - ) 34 - for item in result.content: 35 - print(f" {item.text[:200]}...") 36 - 37 - # test search with since filter 38 - print("\n=== search(query='python', since='2025-01-01', limit=2) ===") 39 - result = await client.call_tool( 40 - "search", {"query": "python", "since": "2025-01-01", "limit": 2} 41 - ) 42 - for item in result.content: 43 - print(f" {item.text[:200]}...") 44 - 45 - # test get_tags 46 - print("\n=== get_tags() ===") 47 - result = await client.call_tool("get_tags", {}) 48 - for item in result.content: 49 - print(f" {item.text[:150]}...") 50 - 51 - # test get_stats 52 - print("\n=== get_stats() ===") 53 - result = await client.call_tool("get_stats", {}) 54 - for item in result.content: 55 - print(f" {item.text}") 56 - 57 - # test get_popular 58 - print("\n=== get_popular(limit=3) ===") 59 - result = await client.call_tool("get_popular", {"limit": 3}) 60 - for item in result.content: 61 - print(f" {item.text[:100]}...") 62 - 63 - print("\n=== all tests passed ===") 64 - 65 - 66 - if __name__ == "__main__": 67 - asyncio.run(main())

+5

mcp/src/leaflet_mcp/__init__.py

··· 1 + """Leaflet MCP server - search decentralized publications on ATProto.""" 2 + 3 + from leaflet_mcp.server import main, mcp 4 + 5 + __all__ = ["main", "mcp"]

+58

mcp/src/leaflet_mcp/_types.py

··· 1 + """Type definitions for Leaflet MCP responses.""" 2 + 3 + from typing import Literal 4 + 5 + from pydantic import BaseModel, computed_field 6 + 7 + 8 + class SearchResult(BaseModel): 9 + """A search result from the Leaflet API.""" 10 + 11 + type: Literal["article", "looseleaf", "publication"] 12 + uri: str 13 + did: str 14 + title: str 15 + snippet: str 16 + createdAt: str = "" 17 + rkey: str 18 + basePath: str = "" 19 + 20 + @computed_field 21 + @property 22 + def url(self) -> str: 23 + """web URL for this document.""" 24 + if self.basePath: 25 + return f"https://{self.basePath}/{self.rkey}" 26 + return "" 27 + 28 + 29 + class Tag(BaseModel): 30 + """A tag with document count.""" 31 + 32 + tag: str 33 + count: int 34 + 35 + 36 + class PopularSearch(BaseModel): 37 + """A popular search query with count.""" 38 + 39 + query: str 40 + count: int 41 + 42 + 43 + class Stats(BaseModel): 44 + """Leaflet index statistics.""" 45 + 46 + documents: int 47 + publications: int 48 + 49 + 50 + class Document(BaseModel): 51 + """Full document content from ATProto.""" 52 + 53 + uri: str 54 + title: str 55 + content: str 56 + createdAt: str = "" 57 + tags: list[str] = [] 58 + publicationUri: str = ""

+21

mcp/src/leaflet_mcp/client.py

··· 1 + """HTTP client for Leaflet search API.""" 2 + 3 + import os 4 + from contextlib import asynccontextmanager 5 + from typing import AsyncIterator 6 + 7 + import httpx 8 + 9 + # configurable via env var, defaults to production 10 + LEAFLET_API_URL = os.getenv("LEAFLET_API_URL", "https://leaflet-search-backend.fly.dev") 11 + 12 + 13 + @asynccontextmanager 14 + async def get_http_client() -> AsyncIterator[httpx.AsyncClient]: 15 + """Get an async HTTP client for Leaflet API requests.""" 16 + async with httpx.AsyncClient( 17 + base_url=LEAFLET_API_URL, 18 + timeout=30.0, 19 + headers={"Accept": "application/json"}, 20 + ) as client: 21 + yield client

+289

mcp/src/leaflet_mcp/server.py

··· 1 + """Leaflet MCP server implementation using fastmcp.""" 2 + 3 + from __future__ import annotations 4 + 5 + from typing import Any 6 + 7 + from fastmcp import FastMCP 8 + 9 + from leaflet_mcp._types import Document, PopularSearch, SearchResult, Stats, Tag 10 + from leaflet_mcp.client import get_http_client 11 + 12 + mcp = FastMCP("leaflet") 13 + 14 + 15 + # ----------------------------------------------------------------------------- 16 + # prompts 17 + # ----------------------------------------------------------------------------- 18 + 19 + 20 + @mcp.prompt("usage_guide") 21 + def usage_guide() -> str: 22 + """instructions for using leaflet MCP tools.""" 23 + return """\ 24 + # Leaflet MCP server usage guide 25 + 26 + Leaflet is a decentralized publishing platform on ATProto (the protocol behind Bluesky). 27 + This MCP server provides search and discovery tools for Leaflet publications. 28 + 29 + ## core tools 30 + 31 + - `search(query, tag)` - search documents and publications by text or tag 32 + - `get_document(uri)` - get the full content of a document by its AT-URI 33 + - `find_similar(uri)` - find documents similar to a given document 34 + - `get_tags()` - list all available tags with document counts 35 + - `get_stats()` - get index statistics (document/publication counts) 36 + - `get_popular()` - see popular search queries 37 + 38 + ## workflow for research 39 + 40 + 1. use `search("your topic")` to find relevant documents 41 + 2. use `get_document(uri)` to retrieve full content of interesting results 42 + 3. use `find_similar(uri)` to discover related content 43 + 44 + ## result types 45 + 46 + search returns three types of results: 47 + - **publication**: a collection of articles (like a blog or magazine) 48 + - **article**: a document that belongs to a publication 49 + - **looseleaf**: a standalone document not part of a publication 50 + 51 + ## AT-URIs 52 + 53 + documents are identified by AT-URIs like: 54 + `at://did:plc:abc123/pub.leaflet.document/xyz789` 55 + 56 + you can also browse documents on the web at leaflet.pub 57 + """ 58 + 59 + 60 + @mcp.prompt("search_tips") 61 + def search_tips() -> str: 62 + """tips for effective searching.""" 63 + return """\ 64 + # Leaflet search tips 65 + 66 + ## text search 67 + - searches both document titles and content 68 + - uses FTS5 full-text search with prefix matching 69 + - the last word gets prefix matching: "cat dog" matches "cat dogs" 70 + 71 + ## tag filtering 72 + - combine text search with tag filter: `search("python", tag="programming")` 73 + - use `get_tags()` to discover available tags 74 + - tags are only applied to documents, not publications 75 + 76 + ## finding related content 77 + - after finding an interesting document, use `find_similar(uri)` 78 + - similarity is based on semantic embeddings (voyage-3-lite) 79 + - great for exploring related topics 80 + 81 + ## browsing by popularity 82 + - use `get_popular()` to see what others are searching for 83 + - can inspire new research directions 84 + """ 85 + 86 + 87 + # ----------------------------------------------------------------------------- 88 + # tools 89 + # ----------------------------------------------------------------------------- 90 + 91 + 92 + @mcp.tool 93 + async def search( 94 + query: str = "", 95 + tag: str | None = None, 96 + limit: int = 5, 97 + ) -> list[SearchResult]: 98 + """search leaflet documents and publications. 99 + 100 + searches the full text of documents (titles and content) and publications. 101 + results include a snippet showing where the match was found. 102 + 103 + args: 104 + query: search query (searches titles and content) 105 + tag: optional tag to filter by (only applies to documents) 106 + limit: max results to return (default 5, max 40) 107 + 108 + returns: 109 + list of search results with uri, title, snippet, and metadata 110 + """ 111 + if not query and not tag: 112 + return [] 113 + 114 + params: dict[str, Any] = {} 115 + if query: 116 + params["q"] = query 117 + if tag: 118 + params["tag"] = tag 119 + 120 + async with get_http_client() as client: 121 + response = await client.get("/search", params=params) 122 + response.raise_for_status() 123 + results = response.json() 124 + 125 + # apply client-side limit since API returns up to 40 126 + return [SearchResult(**r) for r in results[:limit]] 127 + 128 + 129 + @mcp.tool 130 + async def get_document(uri: str) -> Document: 131 + """get the full content of a document by its AT-URI. 132 + 133 + fetches the complete document from ATProto, including full text content. 134 + use this after finding documents via search to get the complete text. 135 + 136 + args: 137 + uri: the AT-URI of the document (e.g., at://did:plc:.../pub.leaflet.document/...) 138 + 139 + returns: 140 + document with full content, title, tags, and metadata 141 + """ 142 + # use pdsx to fetch the actual record from ATProto 143 + try: 144 + from pdsx._internal.operations import get_record 145 + from pdsx.mcp.client import get_atproto_client 146 + except ImportError as e: 147 + raise RuntimeError( 148 + "pdsx is required for fetching full documents. install with: uv add pdsx" 149 + ) from e 150 + 151 + # extract repo from URI for PDS discovery 152 + # at://did:plc:xxx/collection/rkey 153 + parts = uri.replace("at://", "").split("/") 154 + if len(parts) < 3: 155 + raise ValueError(f"invalid AT-URI: {uri}") 156 + 157 + repo = parts[0] 158 + 159 + async with get_atproto_client(target_repo=repo) as client: 160 + record = await get_record(client, uri) 161 + 162 + value = record.value 163 + # DotDict doesn't have a working .get(), convert to dict first 164 + if hasattr(value, "to_dict") and callable(value.to_dict): 165 + value = value.to_dict() 166 + elif not isinstance(value, dict): 167 + value = dict(value) 168 + 169 + # extract content from leaflet's block structure 170 + # pages[].blocks[].block.plaintext 171 + content_parts = [] 172 + for page in value.get("pages", []): 173 + for block_wrapper in page.get("blocks", []): 174 + block = block_wrapper.get("block", {}) 175 + plaintext = block.get("plaintext", "") 176 + if plaintext: 177 + content_parts.append(plaintext) 178 + 179 + content = "\n\n".join(content_parts) 180 + 181 + return Document( 182 + uri=record.uri, 183 + title=value.get("title", ""), 184 + content=content, 185 + createdAt=value.get("publishedAt", "") or value.get("createdAt", ""), 186 + tags=value.get("tags", []), 187 + publicationUri=value.get("publication", ""), 188 + ) 189 + 190 + 191 + @mcp.tool 192 + async def find_similar(uri: str, limit: int = 5) -> list[SearchResult]: 193 + """find documents similar to a given document. 194 + 195 + uses vector similarity (voyage-3-lite embeddings) to find semantically 196 + related documents. great for discovering related content after finding 197 + an interesting document. 198 + 199 + args: 200 + uri: the AT-URI of the document to find similar content for 201 + limit: max similar documents to return (default 5) 202 + 203 + returns: 204 + list of similar documents with uri, title, and metadata 205 + """ 206 + async with get_http_client() as client: 207 + response = await client.get("/similar", params={"uri": uri}) 208 + response.raise_for_status() 209 + results = response.json() 210 + 211 + return [SearchResult(**r) for r in results[:limit]] 212 + 213 + 214 + @mcp.tool 215 + async def get_tags() -> list[Tag]: 216 + """list all available tags with document counts. 217 + 218 + returns tags sorted by document count (most popular first). 219 + useful for discovering topics and filtering searches. 220 + 221 + returns: 222 + list of tags with their document counts 223 + """ 224 + async with get_http_client() as client: 225 + response = await client.get("/tags") 226 + response.raise_for_status() 227 + results = response.json() 228 + 229 + return [Tag(**t) for t in results] 230 + 231 + 232 + @mcp.tool 233 + async def get_stats() -> Stats: 234 + """get leaflet index statistics. 235 + 236 + returns: 237 + document and publication counts 238 + """ 239 + async with get_http_client() as client: 240 + response = await client.get("/stats") 241 + response.raise_for_status() 242 + return Stats(**response.json()) 243 + 244 + 245 + @mcp.tool 246 + async def get_popular(limit: int = 5) -> list[PopularSearch]: 247 + """get popular search queries. 248 + 249 + see what others are searching for on leaflet. 250 + can inspire new research directions. 251 + 252 + args: 253 + limit: max queries to return (default 5) 254 + 255 + returns: 256 + list of popular queries with search counts 257 + """ 258 + async with get_http_client() as client: 259 + response = await client.get("/popular") 260 + response.raise_for_status() 261 + results = response.json() 262 + 263 + return [PopularSearch(**p) for p in results[:limit]] 264 + 265 + 266 + # ----------------------------------------------------------------------------- 267 + # resources 268 + # ----------------------------------------------------------------------------- 269 + 270 + 271 + @mcp.resource("leaflet://stats") 272 + async def stats_resource() -> str: 273 + """current leaflet index statistics.""" 274 + stats = await get_stats() 275 + return f"Leaflet index: {stats.documents} documents, {stats.publications} publications" 276 + 277 + 278 + # ----------------------------------------------------------------------------- 279 + # entrypoint 280 + # ----------------------------------------------------------------------------- 281 + 282 + 283 + def main() -> None: 284 + """run the MCP server.""" 285 + mcp.run() 286 + 287 + 288 + if __name__ == "__main__": 289 + main()

-5

mcp/src/pub_search/__init__.py

··· 1 - """MCP server for searching ATProto publishing platforms.""" 2 - 3 - from pub_search.server import main, mcp 4 - 5 - __all__ = ["main", "mcp"]

-59

mcp/src/pub_search/_types.py

··· 1 - """Type definitions for Leaflet MCP responses.""" 2 - 3 - from typing import Literal 4 - 5 - from pydantic import BaseModel, computed_field 6 - 7 - 8 - class SearchResult(BaseModel): 9 - """A search result from the Leaflet API.""" 10 - 11 - type: Literal["article", "looseleaf", "publication"] 12 - uri: str 13 - did: str 14 - title: str 15 - snippet: str 16 - createdAt: str = "" 17 - rkey: str 18 - basePath: str = "" 19 - platform: Literal["leaflet", "pckt", "offprint", "greengale", "other"] = "leaflet" 20 - 21 - @computed_field 22 - @property 23 - def url(self) -> str: 24 - """web URL for this document.""" 25 - if self.basePath: 26 - return f"https://{self.basePath}/{self.rkey}" 27 - return "" 28 - 29 - 30 - class Tag(BaseModel): 31 - """A tag with document count.""" 32 - 33 - tag: str 34 - count: int 35 - 36 - 37 - class PopularSearch(BaseModel): 38 - """A popular search query with count.""" 39 - 40 - query: str 41 - count: int 42 - 43 - 44 - class Stats(BaseModel): 45 - """Leaflet index statistics.""" 46 - 47 - documents: int 48 - publications: int 49 - 50 - 51 - class Document(BaseModel): 52 - """Full document content from ATProto.""" 53 - 54 - uri: str 55 - title: str 56 - content: str 57 - createdAt: str = "" 58 - tags: list[str] = [] 59 - publicationUri: str = ""

-21

mcp/src/pub_search/client.py

··· 1 - """HTTP client for leaflet-search API.""" 2 - 3 - import os 4 - from contextlib import asynccontextmanager 5 - from typing import AsyncIterator 6 - 7 - import httpx 8 - 9 - # configurable via env var, defaults to production 10 - API_URL = os.getenv("LEAFLET_SEARCH_API_URL", "https://leaflet-search-backend.fly.dev") 11 - 12 - 13 - @asynccontextmanager 14 - async def get_http_client() -> AsyncIterator[httpx.AsyncClient]: 15 - """Get an async HTTP client for API requests.""" 16 - async with httpx.AsyncClient( 17 - base_url=API_URL, 18 - timeout=30.0, 19 - headers={"Accept": "application/json"}, 20 - ) as client: 21 - yield client

-276

mcp/src/pub_search/server.py

··· 1 - """MCP server for searching ATProto publishing platforms.""" 2 - 3 - from __future__ import annotations 4 - 5 - from typing import Any, Literal 6 - 7 - from fastmcp import FastMCP 8 - 9 - from pub_search._types import Document, PopularSearch, SearchResult, Stats, Tag 10 - from pub_search.client import get_http_client 11 - 12 - mcp = FastMCP("pub-search") 13 - 14 - 15 - # ----------------------------------------------------------------------------- 16 - # prompts 17 - # ----------------------------------------------------------------------------- 18 - 19 - 20 - @mcp.prompt("usage_guide") 21 - def usage_guide() -> str: 22 - """instructions for using pub-search MCP tools.""" 23 - return """\ 24 - # pub-search MCP 25 - 26 - search ATProto publishing platforms: leaflet, pckt, offprint, greengale. 27 - 28 - ## tools 29 - 30 - - `search(query, tag, platform, since)` - full-text search with filters 31 - - `get_document(uri)` - fetch full content by AT-URI 32 - - `find_similar(uri)` - semantic similarity search 33 - - `get_tags()` - available tags 34 - - `get_stats()` - index statistics 35 - - `get_popular()` - popular queries 36 - 37 - ## workflow 38 - 39 - 1. `search("topic")` or `search("topic", platform="leaflet")` 40 - 2. `get_document(uri)` for full text 41 - 3. `find_similar(uri)` for related content 42 - 43 - ## result types 44 - 45 - - **article**: document in a publication 46 - - **looseleaf**: standalone document 47 - - **publication**: the publication itself 48 - 49 - results include a `url` field for web access. 50 - """ 51 - 52 - 53 - @mcp.prompt("search_tips") 54 - def search_tips() -> str: 55 - """tips for effective searching.""" 56 - return """\ 57 - # search tips 58 - 59 - - prefix matching on last word: "cat dog" matches "cat dogs" 60 - - combine filters: `search("python", tag="tutorial", platform="leaflet")` 61 - - use `since="2025-01-01"` for recent content 62 - - `find_similar(uri)` for semantic similarity (voyage-3-lite embeddings) 63 - - `get_tags()` to discover available tags 64 - """ 65 - 66 - 67 - # ----------------------------------------------------------------------------- 68 - # tools 69 - # ----------------------------------------------------------------------------- 70 - 71 - 72 - Platform = Literal["leaflet", "pckt", "offprint", "greengale", "other"] 73 - 74 - 75 - @mcp.tool 76 - async def search( 77 - query: str = "", 78 - tag: str | None = None, 79 - platform: Platform | None = None, 80 - since: str | None = None, 81 - limit: int = 5, 82 - ) -> list[SearchResult]: 83 - """search documents and publications. 84 - 85 - args: 86 - query: search query (titles and content) 87 - tag: filter by tag 88 - platform: filter by platform (leaflet, pckt, offprint, greengale, other) 89 - since: ISO date - only documents created after this date 90 - limit: max results (default 5, max 40) 91 - 92 - returns: 93 - list of results with uri, title, snippet, platform, and web url 94 - """ 95 - if not query and not tag: 96 - return [] 97 - 98 - params: dict[str, Any] = {} 99 - if query: 100 - params["q"] = query 101 - if tag: 102 - params["tag"] = tag 103 - if platform: 104 - params["platform"] = platform 105 - if since: 106 - params["since"] = since 107 - 108 - async with get_http_client() as client: 109 - response = await client.get("/search", params=params) 110 - response.raise_for_status() 111 - results = response.json() 112 - 113 - return [SearchResult(**r) for r in results[:limit]] 114 - 115 - 116 - @mcp.tool 117 - async def get_document(uri: str) -> Document: 118 - """get the full content of a document by its AT-URI. 119 - 120 - fetches the complete document from ATProto, including full text content. 121 - use this after finding documents via search to get the complete text. 122 - 123 - args: 124 - uri: the AT-URI of the document (e.g., at://did:plc:.../pub.leaflet.document/...) 125 - 126 - returns: 127 - document with full content, title, tags, and metadata 128 - """ 129 - # use pdsx to fetch the actual record from ATProto 130 - try: 131 - from pdsx._internal.operations import get_record 132 - from pdsx.mcp.client import get_atproto_client 133 - except ImportError as e: 134 - raise RuntimeError( 135 - "pdsx is required for fetching full documents. install with: uv add pdsx" 136 - ) from e 137 - 138 - # extract repo from URI for PDS discovery 139 - # at://did:plc:xxx/collection/rkey 140 - parts = uri.replace("at://", "").split("/") 141 - if len(parts) < 3: 142 - raise ValueError(f"invalid AT-URI: {uri}") 143 - 144 - repo = parts[0] 145 - 146 - async with get_atproto_client(target_repo=repo) as client: 147 - record = await get_record(client, uri) 148 - 149 - value = record.value 150 - # DotDict doesn't have a working .get(), convert to dict first 151 - if hasattr(value, "to_dict") and callable(value.to_dict): 152 - value = value.to_dict() 153 - elif not isinstance(value, dict): 154 - value = dict(value) 155 - 156 - # extract content from leaflet's block structure 157 - # pages[].blocks[].block.plaintext 158 - content_parts = [] 159 - for page in value.get("pages", []): 160 - for block_wrapper in page.get("blocks", []): 161 - block = block_wrapper.get("block", {}) 162 - plaintext = block.get("plaintext", "") 163 - if plaintext: 164 - content_parts.append(plaintext) 165 - 166 - content = "\n\n".join(content_parts) 167 - 168 - return Document( 169 - uri=record.uri, 170 - title=value.get("title", ""), 171 - content=content, 172 - createdAt=value.get("publishedAt", "") or value.get("createdAt", ""), 173 - tags=value.get("tags", []), 174 - publicationUri=value.get("publication", ""), 175 - ) 176 - 177 - 178 - @mcp.tool 179 - async def find_similar(uri: str, limit: int = 5) -> list[SearchResult]: 180 - """find documents similar to a given document. 181 - 182 - uses vector similarity (voyage-3-lite embeddings) to find semantically 183 - related documents. great for discovering related content after finding 184 - an interesting document. 185 - 186 - args: 187 - uri: the AT-URI of the document to find similar content for 188 - limit: max similar documents to return (default 5) 189 - 190 - returns: 191 - list of similar documents with uri, title, and metadata 192 - """ 193 - async with get_http_client() as client: 194 - response = await client.get("/similar", params={"uri": uri}) 195 - response.raise_for_status() 196 - results = response.json() 197 - 198 - return [SearchResult(**r) for r in results[:limit]] 199 - 200 - 201 - @mcp.tool 202 - async def get_tags() -> list[Tag]: 203 - """list all available tags with document counts. 204 - 205 - returns tags sorted by document count (most popular first). 206 - useful for discovering topics and filtering searches. 207 - 208 - returns: 209 - list of tags with their document counts 210 - """ 211 - async with get_http_client() as client: 212 - response = await client.get("/tags") 213 - response.raise_for_status() 214 - results = response.json() 215 - 216 - return [Tag(**t) for t in results] 217 - 218 - 219 - @mcp.tool 220 - async def get_stats() -> Stats: 221 - """get index statistics. 222 - 223 - returns: 224 - document and publication counts 225 - """ 226 - async with get_http_client() as client: 227 - response = await client.get("/stats") 228 - response.raise_for_status() 229 - return Stats(**response.json()) 230 - 231 - 232 - @mcp.tool 233 - async def get_popular(limit: int = 5) -> list[PopularSearch]: 234 - """get popular search queries. 235 - 236 - see what others are searching for. 237 - can inspire new research directions. 238 - 239 - args: 240 - limit: max queries to return (default 5) 241 - 242 - returns: 243 - list of popular queries with search counts 244 - """ 245 - async with get_http_client() as client: 246 - response = await client.get("/popular") 247 - response.raise_for_status() 248 - results = response.json() 249 - 250 - return [PopularSearch(**p) for p in results[:limit]] 251 - 252 - 253 - # ----------------------------------------------------------------------------- 254 - # resources 255 - # ----------------------------------------------------------------------------- 256 - 257 - 258 - @mcp.resource("pub-search://stats") 259 - async def stats_resource() -> str: 260 - """current index statistics.""" 261 - stats = await get_stats() 262 - return f"pub search index: {stats.documents} documents, {stats.publications} publications" 263 - 264 - 265 - # ----------------------------------------------------------------------------- 266 - # entrypoint 267 - # ----------------------------------------------------------------------------- 268 - 269 - 270 - def main() -> None: 271 - """run the MCP server.""" 272 - mcp.run() 273 - 274 - 275 - if __name__ == "__main__": 276 - main()

+9 -12

mcp/tests/test_mcp.py

··· 1 - """tests for pub-search MCP server.""" 1 + """tests for leaflet MCP server.""" 2 2 3 3 import pytest 4 4 from mcp.types import TextContent ··· 6 6 from fastmcp.client import Client 7 7 from fastmcp.client.transports import FastMCPTransport 8 8 9 - from pub_search._types import Document, PopularSearch, SearchResult, Stats, Tag 10 - from pub_search.server import mcp 9 + from leaflet_mcp._types import Document, PopularSearch, SearchResult, Stats, Tag 10 + from leaflet_mcp.server import mcp 11 11 12 12 13 13 class TestTypes: ··· 23 23 snippet="this is a test...", 24 24 createdAt="2025-01-01T00:00:00Z", 25 25 rkey="123", 26 - basePath="gyst.leaflet.pub", 27 - platform="leaflet", 26 + basePath="/blog", 28 27 ) 29 28 assert r.type == "article" 30 29 assert r.uri == "at://did:plc:abc/pub.leaflet.document/123" 31 30 assert r.title == "test article" 32 - assert r.platform == "leaflet" 33 - assert r.url == "https://gyst.leaflet.pub/123" 34 31 35 32 def test_search_result_looseleaf(self): 36 33 """SearchResult supports looseleaf type.""" ··· 96 93 97 94 def test_mcp_server_imports(self): 98 95 """mcp server can be imported without errors.""" 99 - from pub_search import mcp 96 + from leaflet_mcp import mcp 100 97 101 - assert mcp.name == "pub-search" 98 + assert mcp.name == "leaflet" 102 99 103 100 def test_exports(self): 104 101 """all expected exports are available.""" 105 - from pub_search import main, mcp 102 + from leaflet_mcp import main, mcp 106 103 107 104 assert mcp is not None 108 105 assert main is not None ··· 141 138 resources = await client.list_resources() 142 139 143 140 resource_uris = {str(r.uri) for r in resources} 144 - assert "pub-search://stats" in resource_uris 141 + assert "leaflet://stats" in resource_uris 145 142 146 143 async def test_usage_guide_prompt_content(self, client): 147 144 """usage_guide prompt returns helpful content.""" ··· 151 148 assert len(result.messages) > 0 152 149 content = result.messages[0].content 153 150 assert isinstance(content, TextContent) 154 - assert "pub-search" in content.text 151 + assert "Leaflet" in content.text 155 152 assert "search" in content.text 156 153 157 154 async def test_search_tips_prompt_content(self, client):

+32 -32

mcp/uv.lock

··· 691 691 ] 692 692 693 693 [[package]] 694 + name = "leaflet-mcp" 695 + source = { editable = "." } 696 + dependencies = [ 697 + { name = "fastmcp" }, 698 + { name = "httpx" }, 699 + { name = "pdsx" }, 700 + ] 701 + 702 + [package.dev-dependencies] 703 + dev = [ 704 + { name = "pytest" }, 705 + { name = "pytest-asyncio" }, 706 + { name = "pytest-sugar" }, 707 + { name = "ruff" }, 708 + ] 709 + 710 + [package.metadata] 711 + requires-dist = [ 712 + { name = "fastmcp", specifier = ">=2.0" }, 713 + { name = "httpx", specifier = ">=0.28" }, 714 + { name = "pdsx", git = "https://github.com/zzstoatzz/pdsx.git" }, 715 + ] 716 + 717 + [package.metadata.requires-dev] 718 + dev = [ 719 + { name = "pytest", specifier = ">=8.3.0" }, 720 + { name = "pytest-asyncio", specifier = ">=0.25.0" }, 721 + { name = "pytest-sugar" }, 722 + { name = "ruff", specifier = ">=0.12.0" }, 723 + ] 724 + 725 + [[package]] 694 726 name = "libipld" 695 727 version = "3.3.2" 696 728 source = { registry = "https://pypi.org/simple" } ··· 1043 1075 sdist = { url = "https://files.pythonhosted.org/packages/23/53/3edb5d68ecf6b38fcbcc1ad28391117d2a322d9a1a3eff04bfdb184d8c3b/prometheus_client-0.23.1.tar.gz", hash = "sha256:6ae8f9081eaaaf153a2e959d2e6c4f4fb57b12ef76c8c7980202f1e57b48b2ce", size = 80481, upload-time = "2025-09-18T20:47:25.043Z" } 1044 1076 wheels = [ 1045 1077 { url = "https://files.pythonhosted.org/packages/b8/db/14bafcb4af2139e046d03fd00dea7873e48eafe18b7d2797e73d6681f210/prometheus_client-0.23.1-py3-none-any.whl", hash = "sha256:dd1913e6e76b59cfe44e7a4b83e01afc9873c1bdfd2ed8739f1e76aeca115f99", size = 61145, upload-time = "2025-09-18T20:47:23.875Z" }, 1046 - ] 1047 - 1048 - [[package]] 1049 - name = "pub-search" 1050 - source = { editable = "." } 1051 - dependencies = [ 1052 - { name = "fastmcp" }, 1053 - { name = "httpx" }, 1054 - { name = "pdsx" }, 1055 - ] 1056 - 1057 - [package.dev-dependencies] 1058 - dev = [ 1059 - { name = "pytest" }, 1060 - { name = "pytest-asyncio" }, 1061 - { name = "pytest-sugar" }, 1062 - { name = "ruff" }, 1063 - ] 1064 - 1065 - [package.metadata] 1066 - requires-dist = [ 1067 - { name = "fastmcp", specifier = ">=2.0" }, 1068 - { name = "httpx", specifier = ">=0.28" }, 1069 - { name = "pdsx", git = "https://github.com/zzstoatzz/pdsx.git" }, 1070 - ] 1071 - 1072 - [package.metadata.requires-dev] 1073 - dev = [ 1074 - { name = "pytest", specifier = ">=8.3.0" }, 1075 - { name = "pytest-asyncio", specifier = ">=0.25.0" }, 1076 - { name = "pytest-sugar" }, 1077 - { name = "ruff", specifier = ">=0.12.0" }, 1078 1078 ] 1079 1079 1080 1080 [[package]]

-384

scripts/backfill-pds

··· 1 - #!/usr/bin/env -S uv run --script --quiet 2 - # /// script 3 - # requires-python = ">=3.12" 4 - # dependencies = ["httpx", "pydantic-settings"] 5 - # /// 6 - """ 7 - Backfill records directly from a PDS. 8 - 9 - Usage: 10 - ./scripts/backfill-pds did:plc:mkqt76xvfgxuemlwlx6ruc3w 11 - ./scripts/backfill-pds zat.dev 12 - """ 13 - 14 - import argparse 15 - import json 16 - import os 17 - import sys 18 - 19 - import httpx 20 - from pydantic_settings import BaseSettings, SettingsConfigDict 21 - 22 - 23 - class Settings(BaseSettings): 24 - model_config = SettingsConfigDict( 25 - env_file=os.environ.get("ENV_FILE", ".env"), extra="ignore" 26 - ) 27 - 28 - turso_url: str 29 - turso_token: str 30 - 31 - @property 32 - def turso_host(self) -> str: 33 - url = self.turso_url 34 - if url.startswith("libsql://"): 35 - url = url[len("libsql://") :] 36 - return url 37 - 38 - 39 - def resolve_handle(handle: str) -> str: 40 - """Resolve a handle to a DID.""" 41 - resp = httpx.get( 42 - f"https://bsky.social/xrpc/com.atproto.identity.resolveHandle", 43 - params={"handle": handle}, 44 - timeout=30, 45 - ) 46 - resp.raise_for_status() 47 - return resp.json()["did"] 48 - 49 - 50 - def get_pds_endpoint(did: str) -> str: 51 - """Get PDS endpoint from PLC directory.""" 52 - resp = httpx.get(f"https://plc.directory/{did}", timeout=30) 53 - resp.raise_for_status() 54 - data = resp.json() 55 - for service in data.get("service", []): 56 - if service.get("type") == "AtprotoPersonalDataServer": 57 - return service["serviceEndpoint"] 58 - raise ValueError(f"No PDS endpoint found for {did}") 59 - 60 - 61 - def list_records(pds: str, did: str, collection: str) -> list[dict]: 62 - """List all records from a collection.""" 63 - records = [] 64 - cursor = None 65 - while True: 66 - params = {"repo": did, "collection": collection, "limit": 100} 67 - if cursor: 68 - params["cursor"] = cursor 69 - resp = httpx.get( 70 - f"{pds}/xrpc/com.atproto.repo.listRecords", params=params, timeout=30 71 - ) 72 - resp.raise_for_status() 73 - data = resp.json() 74 - records.extend(data.get("records", [])) 75 - cursor = data.get("cursor") 76 - if not cursor: 77 - break 78 - return records 79 - 80 - 81 - def turso_exec(settings: Settings, sql: str, args: list | None = None) -> None: 82 - """Execute a statement against Turso.""" 83 - stmt = {"sql": sql} 84 - if args: 85 - # Handle None values properly - use null type 86 - stmt["args"] = [] 87 - for a in args: 88 - if a is None: 89 - stmt["args"].append({"type": "null"}) 90 - else: 91 - stmt["args"].append({"type": "text", "value": str(a)}) 92 - 93 - response = httpx.post( 94 - f"https://{settings.turso_host}/v2/pipeline", 95 - headers={ 96 - "Authorization": f"Bearer {settings.turso_token}", 97 - "Content-Type": "application/json", 98 - }, 99 - json={"requests": [{"type": "execute", "stmt": stmt}, {"type": "close"}]}, 100 - timeout=30, 101 - ) 102 - if response.status_code != 200: 103 - print(f"Turso error: {response.text}", file=sys.stderr) 104 - response.raise_for_status() 105 - 106 - 107 - def extract_leaflet_blocks(pages: list) -> str: 108 - """Extract text from leaflet pages/blocks structure.""" 109 - texts = [] 110 - for page in pages: 111 - if not isinstance(page, dict): 112 - continue 113 - blocks = page.get("blocks", []) 114 - for wrapper in blocks: 115 - if not isinstance(wrapper, dict): 116 - continue 117 - block = wrapper.get("block", {}) 118 - if not isinstance(block, dict): 119 - continue 120 - # Extract plaintext from text, header, blockquote, code blocks 121 - block_type = block.get("$type", "") 122 - if block_type in ( 123 - "pub.leaflet.blocks.text", 124 - "pub.leaflet.blocks.header", 125 - "pub.leaflet.blocks.blockquote", 126 - "pub.leaflet.blocks.code", 127 - ): 128 - plaintext = block.get("plaintext", "") 129 - if plaintext: 130 - texts.append(plaintext) 131 - # Handle lists 132 - elif block_type == "pub.leaflet.blocks.unorderedList": 133 - texts.extend(extract_list_items(block.get("children", []))) 134 - return " ".join(texts) 135 - 136 - 137 - def extract_list_items(children: list) -> list[str]: 138 - """Recursively extract text from list items.""" 139 - texts = [] 140 - for child in children: 141 - if not isinstance(child, dict): 142 - continue 143 - content = child.get("content", {}) 144 - if isinstance(content, dict): 145 - plaintext = content.get("plaintext", "") 146 - if plaintext: 147 - texts.append(plaintext) 148 - # Recurse into nested children 149 - nested = child.get("children", []) 150 - if nested: 151 - texts.extend(extract_list_items(nested)) 152 - return texts 153 - 154 - 155 - def extract_document(record: dict, collection: str) -> dict | None: 156 - """Extract document fields from a record.""" 157 - value = record.get("value", {}) 158 - 159 - # Get title 160 - title = value.get("title") 161 - if not title: 162 - return None 163 - 164 - # Get content - try textContent (site.standard), then leaflet blocks 165 - content = value.get("textContent") or "" 166 - if not content: 167 - # Try leaflet-style pages/blocks at top level (pub.leaflet.document) 168 - pages = value.get("pages", []) 169 - if pages: 170 - content = extract_leaflet_blocks(pages) 171 - if not content: 172 - # Try content.pages (site.standard.document with pub.leaflet.content) 173 - content_obj = value.get("content") 174 - if isinstance(content_obj, dict): 175 - pages = content_obj.get("pages", []) 176 - if pages: 177 - content = extract_leaflet_blocks(pages) 178 - 179 - # Get created_at 180 - created_at = value.get("createdAt", "") 181 - 182 - # Get publication reference - try "publication" (leaflet) then "site" (site.standard) 183 - publication = value.get("publication") or value.get("site") 184 - publication_uri = None 185 - if publication: 186 - if isinstance(publication, dict): 187 - publication_uri = publication.get("uri") 188 - elif isinstance(publication, str): 189 - publication_uri = publication 190 - 191 - # Get URL path (site.standard.document uses "path" field like "/001") 192 - path = value.get("path") 193 - 194 - # Get tags 195 - tags = value.get("tags", []) 196 - if not isinstance(tags, list): 197 - tags = [] 198 - 199 - # Determine platform from collection (site.standard is a lexicon, not a platform) 200 - if collection.startswith("pub.leaflet"): 201 - platform = "leaflet" 202 - elif collection.startswith("blog.pckt"): 203 - platform = "pckt" 204 - else: 205 - # site.standard.* and others - platform will be detected from publication basePath 206 - platform = "unknown" 207 - 208 - return { 209 - "title": title, 210 - "content": content, 211 - "created_at": created_at, 212 - "publication_uri": publication_uri, 213 - "tags": tags, 214 - "platform": platform, 215 - "collection": collection, 216 - "path": path, 217 - } 218 - 219 - 220 - def main(): 221 - parser = argparse.ArgumentParser(description="Backfill records from a PDS") 222 - parser.add_argument("identifier", help="DID or handle to backfill") 223 - parser.add_argument("--dry-run", action="store_true", help="Show what would be done") 224 - args = parser.parse_args() 225 - 226 - try: 227 - settings = Settings() # type: ignore 228 - except Exception as e: 229 - print(f"error loading settings: {e}", file=sys.stderr) 230 - print("required env vars: TURSO_URL, TURSO_TOKEN", file=sys.stderr) 231 - sys.exit(1) 232 - 233 - # Resolve identifier to DID 234 - identifier = args.identifier 235 - if identifier.startswith("did:"): 236 - did = identifier 237 - else: 238 - print(f"resolving handle {identifier}...") 239 - did = resolve_handle(identifier) 240 - print(f" -> {did}") 241 - 242 - # Get PDS endpoint 243 - print(f"looking up PDS for {did}...") 244 - pds = get_pds_endpoint(did) 245 - print(f" -> {pds}") 246 - 247 - # Collections to fetch 248 - collections = [ 249 - "pub.leaflet.document", 250 - "pub.leaflet.publication", 251 - "site.standard.document", 252 - "site.standard.publication", 253 - ] 254 - 255 - total_docs = 0 256 - total_pubs = 0 257 - 258 - for collection in collections: 259 - print(f"fetching {collection}...") 260 - try: 261 - records = list_records(pds, did, collection) 262 - except httpx.HTTPStatusError as e: 263 - if e.response.status_code == 400: 264 - print(f" (no records)") 265 - continue 266 - raise 267 - 268 - if not records: 269 - print(f" (no records)") 270 - continue 271 - 272 - print(f" found {len(records)} records") 273 - 274 - for record in records: 275 - uri = record["uri"] 276 - # Parse rkey from URI: at://did/collection/rkey 277 - parts = uri.split("/") 278 - rkey = parts[-1] 279 - 280 - if collection.endswith(".document"): 281 - doc = extract_document(record, collection) 282 - if not doc: 283 - print(f" skip {uri} (no title)") 284 - continue 285 - 286 - if args.dry_run: 287 - print(f" would insert: {doc['title'][:50]}...") 288 - else: 289 - # Insert document 290 - turso_exec( 291 - settings, 292 - """ 293 - INSERT INTO documents (uri, did, rkey, title, content, created_at, publication_uri, platform, source_collection, path) 294 - VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?) 295 - ON CONFLICT(did, rkey) DO UPDATE SET 296 - uri = excluded.uri, 297 - title = excluded.title, 298 - content = excluded.content, 299 - created_at = excluded.created_at, 300 - publication_uri = excluded.publication_uri, 301 - platform = excluded.platform, 302 - source_collection = excluded.source_collection, 303 - path = excluded.path 304 - """, 305 - [uri, did, rkey, doc["title"], doc["content"], doc["created_at"], doc["publication_uri"], doc["platform"], doc["collection"], doc["path"]], 306 - ) 307 - # Insert tags 308 - for tag in doc["tags"]: 309 - turso_exec( 310 - settings, 311 - "INSERT OR IGNORE INTO document_tags (document_uri, tag) VALUES (?, ?)", 312 - [uri, tag], 313 - ) 314 - # Update FTS index (delete then insert, FTS5 doesn't support ON CONFLICT) 315 - turso_exec(settings, "DELETE FROM documents_fts WHERE uri = ?", [uri]) 316 - turso_exec( 317 - settings, 318 - "INSERT INTO documents_fts (uri, title, content) VALUES (?, ?, ?)", 319 - [uri, doc["title"], doc["content"]], 320 - ) 321 - print(f" indexed: {doc['title'][:50]}...") 322 - total_docs += 1 323 - 324 - elif collection.endswith(".publication"): 325 - value = record["value"] 326 - name = value.get("name", "") 327 - description = value.get("description") 328 - # base_path: try leaflet's "base_path", then strip scheme from site.standard's "url" 329 - base_path = value.get("base_path") 330 - if not base_path: 331 - url = value.get("url") 332 - if url: 333 - # Strip https:// or http:// prefix 334 - if url.startswith("https://"): 335 - base_path = url[len("https://"):] 336 - elif url.startswith("http://"): 337 - base_path = url[len("http://"):] 338 - else: 339 - base_path = url 340 - 341 - if args.dry_run: 342 - print(f" would insert pub: {name}") 343 - else: 344 - turso_exec( 345 - settings, 346 - """ 347 - INSERT INTO publications (uri, did, rkey, name, description, base_path) 348 - VALUES (?, ?, ?, ?, ?, ?) 349 - ON CONFLICT(uri) DO UPDATE SET 350 - name = excluded.name, 351 - description = excluded.description, 352 - base_path = excluded.base_path 353 - """, 354 - [uri, did, rkey, name, description, base_path], 355 - ) 356 - print(f" indexed pub: {name}") 357 - total_pubs += 1 358 - 359 - # post-process: detect platform from publication basePath 360 - if not args.dry_run and (total_docs > 0 or total_pubs > 0): 361 - print("detecting platforms from publication basePath...") 362 - turso_exec( 363 - settings, 364 - """ 365 - UPDATE documents SET platform = 'pckt' 366 - WHERE platform IN ('standardsite', 'unknown') 367 - AND publication_uri IN (SELECT uri FROM publications WHERE base_path LIKE '%pckt.blog%') 368 - """, 369 - ) 370 - turso_exec( 371 - settings, 372 - """ 373 - UPDATE documents SET platform = 'leaflet' 374 - WHERE platform IN ('standardsite', 'unknown') 375 - AND publication_uri IN (SELECT uri FROM publications WHERE base_path LIKE '%leaflet.pub%') 376 - """, 377 - ) 378 - print(" done") 379 - 380 - print(f"\ndone! {total_docs} documents, {total_pubs} publications") 381 - 382 - 383 - if __name__ == "__main__": 384 - main()

-198

scripts/bench-search

··· 1 - #!/usr/bin/env -S uv run --script --quiet 2 - # /// script 3 - # requires-python = ">=3.12" 4 - # dependencies = ["httpx", "rich"] 5 - # /// 6 - """ 7 - benchmark search API permutations to find performance issues. 8 - 9 - Usage: 10 - ./scripts/bench-search # run with defaults 11 - ./scripts/bench-search --runs 5 # more runs per permutation 12 - ./scripts/bench-search --local # test local server 13 - """ 14 - 15 - import asyncio 16 - import statistics 17 - import sys 18 - import time 19 - from dataclasses import dataclass 20 - 21 - import httpx 22 - from rich.console import Console 23 - 24 - BASE_URL = "https://leaflet-search-backend.fly.dev" 25 - 26 - QUERIES = ["python", "atproto", "rust", "blog", ""] 27 - TAGS = ["atproto", "bluesky", "rust", "Webworld", ""] 28 - PLATFORMS = ["leaflet", "pckt", ""] 29 - LIMITS = [10, 40, ""] 30 - 31 - 32 - @dataclass 33 - class Result: 34 - name: str 35 - params: dict 36 - times: list[float] 37 - count: int 38 - status: int 39 - 40 - @property 41 - def avg(self) -> float: 42 - return statistics.mean(self.times) * 1000 43 - 44 - @property 45 - def min(self) -> float: 46 - return min(self.times) * 1000 47 - 48 - @property 49 - def max(self) -> float: 50 - return max(self.times) * 1000 51 - 52 - @property 53 - def p50(self) -> float: 54 - return statistics.median(self.times) * 1000 55 - 56 - @property 57 - def stdev(self) -> float: 58 - return statistics.stdev(self.times) * 1000 if len(self.times) > 1 else 0 59 - 60 - 61 - async def bench_search( 62 - client: httpx.AsyncClient, params: dict, runs: int 63 - ) -> Result: 64 - """benchmark a single search permutation.""" 65 - times = [] 66 - count = 0 67 - status = 0 68 - 69 - # filter empty params 70 - clean_params = {k: v for k, v in params.items() if v} 71 - 72 - name = " + ".join(f"{k}={v}" for k, v in clean_params.items()) or "(empty)" 73 - 74 - for _ in range(runs): 75 - start = time.perf_counter() 76 - try: 77 - resp = await client.get("/search", params=clean_params) 78 - elapsed = time.perf_counter() - start 79 - times.append(elapsed) 80 - status = resp.status_code 81 - if resp.status_code == 200: 82 - count = len(resp.json()) 83 - except Exception as e: 84 - times.append(30.0) # timeout 85 - status = 0 86 - # small delay between runs to avoid overwhelming server 87 - await asyncio.sleep(0.1) 88 - 89 - return Result(name=name, params=clean_params, times=times, count=count, status=status) 90 - 91 - 92 - async def run_benchmarks(runs: int, console: Console) -> list[Result]: 93 - """run all search permutations.""" 94 - results = [] 95 - 96 - # build permutations - focus on meaningful combinations 97 - permutations = [] 98 - 99 - # query only 100 - for q in QUERIES: 101 - if q: 102 - permutations.append({"q": q}) 103 - 104 - # tag only 105 - for tag in TAGS: 106 - if tag: 107 - permutations.append({"tag": tag}) 108 - 109 - # query + tag 110 - for q in ["python", "blog"]: 111 - for tag in ["atproto", "rust"]: 112 - permutations.append({"q": q, "tag": tag}) 113 - 114 - # platform filter 115 - for platform in PLATFORMS: 116 - if platform: 117 - permutations.append({"q": "blog", "platform": platform}) 118 - 119 - # limit variations 120 - for limit in LIMITS: 121 - if limit: 122 - permutations.append({"q": "python", "limit": limit}) 123 - 124 - # tag + platform 125 - for tag in ["atproto", "bluesky"]: 126 - permutations.append({"tag": tag, "platform": "leaflet"}) 127 - 128 - # empty (should return recent) 129 - permutations.append({}) 130 - 131 - console.print(f"[dim]running {len(permutations)} permutations × {runs} runs each...[/dim]\n") 132 - 133 - async with httpx.AsyncClient(base_url=BASE_URL, timeout=30) as client: 134 - # warmup 135 - await client.get("/health") 136 - 137 - for i, params in enumerate(permutations): 138 - result = await bench_search(client, params, runs) 139 - results.append(result) 140 - # progress dot 141 - console.print(".", end="", style="dim") 142 - if (i + 1) % 20 == 0: 143 - console.print() 144 - 145 - console.print("\n") 146 - return results 147 - 148 - 149 - def print_results(results: list[Result], console: Console): 150 - """print results as plain text.""" 151 - # sort by p50 descending to show slowest first 152 - results.sort(key=lambda r: r.p50, reverse=True) 153 - 154 - console.print("results (sorted by p50, slowest first):\n") 155 - console.print(f"{'permutation':<40} {'p50':>8} {'avg':>8} {'min':>8} {'max':>8} {'count':>6}") 156 - console.print("-" * 80) 157 - 158 - for r in results: 159 - p50_str = f"{r.p50:.0f}ms" 160 - if r.p50 > 1000: 161 - p50_str = f"[red bold]{p50_str}[/red bold]" 162 - elif r.p50 > 500: 163 - p50_str = f"[yellow]{p50_str}[/yellow]" 164 - elif r.p50 < 200: 165 - p50_str = f"[green]{p50_str}[/green]" 166 - 167 - console.print(f"{r.name:<40} {p50_str:>8} {r.avg:>7.0f}ms {r.min:>7.0f}ms {r.max:>7.0f}ms {r.count:>6}") 168 - 169 - # summary 170 - console.print() 171 - slow = [r for r in results if r.p50 > 500] 172 - if slow: 173 - console.print(f"[yellow]⚠ {len(slow)} slow (p50 > 500ms)[/yellow]") 174 - else: 175 - console.print("[green]✓ all under 500ms p50[/green]") 176 - 177 - 178 - async def main(): 179 - global BASE_URL 180 - 181 - runs = 3 182 - if "--runs" in sys.argv: 183 - idx = sys.argv.index("--runs") 184 - if idx + 1 < len(sys.argv): 185 - runs = int(sys.argv[idx + 1]) 186 - 187 - if "--local" in sys.argv: 188 - BASE_URL = "http://localhost:3000" 189 - 190 - console = Console() 191 - console.print(f"[bold]benchmarking {BASE_URL}[/bold]\n") 192 - 193 - results = await run_benchmarks(runs, console) 194 - print_results(results, console) 195 - 196 - 197 - if __name__ == "__main__": 198 - asyncio.run(main())

-51

scripts/diagnose-latency.py

··· 1 - #!/usr/bin/env python3 2 - """diagnose search API latency""" 3 - 4 - import time 5 - import httpx 6 - 7 - BASE_URL = "https://leaflet-search-backend.fly.dev" 8 - 9 - def timed_request(url: str, name: str) -> float: 10 - start = time.perf_counter() 11 - resp = httpx.get(url, timeout=30) 12 - elapsed = time.perf_counter() - start 13 - print(f"{name}: {elapsed:.3f}s (status={resp.status_code})") 14 - return elapsed 15 - 16 - def main(): 17 - print("=== latency diagnosis ===\n") 18 - 19 - # warmup 20 - print("1. warmup request (health):") 21 - timed_request(f"{BASE_URL}/health", " health") 22 - 23 - print("\n2. first search (cold db?):") 24 - t1 = timed_request(f"{BASE_URL}/search?q=atproto", " search") 25 - 26 - print("\n3. rapid follow-up searches:") 27 - times = [] 28 - for i in range(5): 29 - t = timed_request(f"{BASE_URL}/search?q=blog", f" search {i+1}") 30 - times.append(t) 31 - 32 - print(f"\n4. stats (single query):") 33 - timed_request(f"{BASE_URL}/stats", " stats") 34 - 35 - print(f"\n5. dashboard (batched queries):") 36 - timed_request(f"{BASE_URL}/api/dashboard", " dashboard") 37 - 38 - print("\n=== summary ===") 39 - print(f"first search: {t1:.3f}s") 40 - print(f"follow-up avg: {sum(times)/len(times):.3f}s") 41 - print(f"follow-up min: {min(times):.3f}s") 42 - print(f"follow-up max: {max(times):.3f}s") 43 - 44 - if t1 > 5: 45 - print("\n⚠️ first request very slow - likely turso cold start") 46 - if sum(times)/len(times) > 1: 47 - print("⚠️ follow-up requests still slow - query optimization needed") 48 - print(" suggestion: batch the 3 search queries into 1 http request") 49 - 50 - if __name__ == "__main__": 51 - main()

-109

scripts/enumerate-standard-repos

··· 1 - #!/usr/bin/env -S uv run --script --quiet 2 - # /// script 3 - # requires-python = ">=3.12" 4 - # dependencies = ["httpx"] 5 - # /// 6 - """ 7 - Enumerate repos with site.standard.* records and add them to TAP. 8 - 9 - TAP only signals on one collection, so we use this to discover repos 10 - that use site.standard.publication (pckt, etc) and add them to TAP. 11 - 12 - Usage: 13 - ./scripts/enumerate-standard-repos 14 - ./scripts/enumerate-standard-repos --dry-run 15 - """ 16 - 17 - import argparse 18 - import sys 19 - 20 - import httpx 21 - 22 - RELAY_URL = "https://relay1.us-east.bsky.network" 23 - TAP_URL = "http://leaflet-search-tap.internal:2480" # fly internal network 24 - COLLECTION = "site.standard.publication" 25 - 26 - 27 - def enumerate_repos(relay_url: str, collection: str) -> list[str]: 28 - """Enumerate all repos with records in the given collection.""" 29 - dids = [] 30 - cursor = None 31 - 32 - print(f"enumerating repos with {collection}...") 33 - 34 - while True: 35 - params = {"collection": collection, "limit": 1000} 36 - if cursor: 37 - params["cursor"] = cursor 38 - 39 - resp = httpx.get( 40 - f"{relay_url}/xrpc/com.atproto.sync.listReposByCollection", 41 - params=params, 42 - timeout=60, 43 - ) 44 - resp.raise_for_status() 45 - data = resp.json() 46 - 47 - repos = data.get("repos", []) 48 - for repo in repos: 49 - dids.append(repo["did"]) 50 - 51 - if not repos: 52 - break 53 - 54 - cursor = data.get("cursor") 55 - if not cursor: 56 - break 57 - 58 - print(f" found {len(dids)} repos so far...") 59 - 60 - return dids 61 - 62 - 63 - def add_repos_to_tap(tap_url: str, dids: list[str]) -> None: 64 - """Add repos to TAP for syncing.""" 65 - if not dids: 66 - return 67 - 68 - # batch in chunks of 100 69 - batch_size = 100 70 - for i in range(0, len(dids), batch_size): 71 - batch = dids[i:i + batch_size] 72 - resp = httpx.post( 73 - f"{tap_url}/repos/add", 74 - json={"dids": batch}, 75 - timeout=30, 76 - ) 77 - resp.raise_for_status() 78 - print(f" added batch {i // batch_size + 1}: {len(batch)} repos") 79 - 80 - 81 - def main(): 82 - parser = argparse.ArgumentParser(description="Enumerate and add standard.site repos to TAP") 83 - parser.add_argument("--dry-run", action="store_true", help="Show what would be done") 84 - parser.add_argument("--relay-url", default=RELAY_URL, help="Relay URL") 85 - parser.add_argument("--tap-url", default=TAP_URL, help="TAP URL") 86 - args = parser.parse_args() 87 - 88 - dids = enumerate_repos(args.relay_url, COLLECTION) 89 - print(f"found {len(dids)} repos with {COLLECTION}") 90 - 91 - if not dids: 92 - print("no repos to add") 93 - return 94 - 95 - if args.dry_run: 96 - print("dry run - would add these repos to TAP:") 97 - for did in dids[:10]: 98 - print(f" {did}") 99 - if len(dids) > 10: 100 - print(f" ... and {len(dids) - 10} more") 101 - return 102 - 103 - print(f"adding {len(dids)} repos to TAP...") 104 - add_repos_to_tap(args.tap_url, dids) 105 - print("done!") 106 - 107 - 108 - if __name__ == "__main__": 109 - main()

-117

scripts/exercise-api

··· 1 - #!/usr/bin/env -S uv run --script --quiet 2 - # /// script 3 - # requires-python = ">=3.12" 4 - # dependencies = ["httpx"] 5 - # /// 6 - """ 7 - Exercise all leaflet-search API endpoints to generate traffic for stats. 8 - 9 - Usage: 10 - ./scripts/exercise-api # run with defaults 11 - ./scripts/exercise-api --count 20 # more requests per endpoint 12 - """ 13 - 14 - import asyncio 15 - import random 16 - import sys 17 - 18 - import httpx 19 - 20 - BASE_URL = "https://leaflet-search-backend.fly.dev" 21 - 22 - SEARCH_QUERIES = [ 23 - "python", "rust", "zig", "javascript", "typescript", 24 - "prefect", "workflow", "automation", "data", "api", 25 - "database", "sqlite", "machine learning", "llm", "claude", 26 - "bluesky", "atproto", "leaflet", "publishing", "markdown", 27 - "web", "server", "deploy", "docker", "async", 28 - ] 29 - 30 - 31 - async def exercise_search(client: httpx.AsyncClient, count: int): 32 - """Hit search endpoint with various queries.""" 33 - print(f"search: {count} requests...") 34 - for i in range(count): 35 - q = random.choice(SEARCH_QUERIES) 36 - resp = await client.get(f"/search", params={"q": q}) 37 - if resp.status_code != 200: 38 - print(f" search failed: {resp.status_code}") 39 - print(f" done") 40 - 41 - 42 - async def exercise_similar(client: httpx.AsyncClient, count: int): 43 - """Hit similar endpoint with document URIs.""" 44 - print(f"similar: {count} requests...") 45 - # first get some URIs from search 46 - resp = await client.get("/search", params={"q": "python"}) 47 - if resp.status_code != 200: 48 - print(" failed to get URIs") 49 - return 50 - docs = resp.json() 51 - if not docs: 52 - print(" no docs found") 53 - return 54 - 55 - uris = [d["uri"] for d in docs[:5]] 56 - for i in range(count): 57 - uri = random.choice(uris) 58 - resp = await client.get("/similar", params={"uri": uri}) 59 - if resp.status_code != 200: 60 - print(f" similar failed: {resp.status_code}") 61 - print(f" done") 62 - 63 - 64 - async def exercise_tags(client: httpx.AsyncClient, count: int): 65 - """Hit tags endpoint.""" 66 - print(f"tags: {count} requests...") 67 - for i in range(count): 68 - resp = await client.get("/tags") 69 - if resp.status_code != 200: 70 - print(f" tags failed: {resp.status_code}") 71 - print(f" done") 72 - 73 - 74 - async def exercise_popular(client: httpx.AsyncClient, count: int): 75 - """Hit popular endpoint.""" 76 - print(f"popular: {count} requests...") 77 - for i in range(count): 78 - resp = await client.get("/popular") 79 - if resp.status_code != 200: 80 - print(f" popular failed: {resp.status_code}") 81 - print(f" done") 82 - 83 - 84 - async def main(): 85 - count = 12 86 - if "--count" in sys.argv: 87 - idx = sys.argv.index("--count") 88 - if idx + 1 < len(sys.argv): 89 - count = int(sys.argv[idx + 1]) 90 - 91 - print(f"exercising {BASE_URL} ({count} requests per endpoint)\n") 92 - 93 - async with httpx.AsyncClient(base_url=BASE_URL, timeout=30) as client: 94 - await asyncio.gather( 95 - exercise_search(client, count), 96 - exercise_similar(client, count), 97 - exercise_tags(client, count), 98 - exercise_popular(client, count), 99 - ) 100 - 101 - print("\nfetching stats...") 102 - async with httpx.AsyncClient(base_url=BASE_URL, timeout=30) as client: 103 - resp = await client.get("/api/dashboard") 104 - data = resp.json() 105 - timing = data.get("timing", {}) 106 - print("\nendpoint count p50 p95") 107 - print("-" * 40) 108 - for name in ["search", "similar", "tags", "popular"]: 109 - t = timing.get(name, {}) 110 - c = t.get("count", 0) 111 - p50 = t.get("p50_ms", 0) 112 - p95 = t.get("p95_ms", 0) 113 - print(f"{name:<12} {c:<6} {p50:>6.0f}ms {p95:>6.0f}ms") 114 - 115 - 116 - if __name__ == "__main__": 117 - asyncio.run(main())

-86

scripts/rebuild-pub-fts

··· 1 - #!/usr/bin/env -S uv run --script --quiet 2 - # /// script 3 - # requires-python = ">=3.12" 4 - # dependencies = ["httpx", "pydantic-settings"] 5 - # /// 6 - """Rebuild publications_fts with base_path column for subdomain search.""" 7 - import os 8 - import httpx 9 - from pydantic_settings import BaseSettings, SettingsConfigDict 10 - 11 - 12 - class Settings(BaseSettings): 13 - model_config = SettingsConfigDict( 14 - env_file=os.environ.get("ENV_FILE", ".env"), extra="ignore" 15 - ) 16 - turso_url: str 17 - turso_token: str 18 - 19 - @property 20 - def turso_host(self) -> str: 21 - url = self.turso_url 22 - if url.startswith("libsql://"): 23 - url = url[len("libsql://") :] 24 - return url 25 - 26 - 27 - settings = Settings() # type: ignore 28 - 29 - print("Rebuilding publications_fts with base_path column...") 30 - 31 - response = httpx.post( 32 - f"https://{settings.turso_host}/v2/pipeline", 33 - headers={ 34 - "Authorization": f"Bearer {settings.turso_token}", 35 - "Content-Type": "application/json", 36 - }, 37 - json={ 38 - "requests": [ 39 - {"type": "execute", "stmt": {"sql": "DROP TABLE IF EXISTS publications_fts"}}, 40 - { 41 - "type": "execute", 42 - "stmt": { 43 - "sql": """ 44 - CREATE VIRTUAL TABLE publications_fts USING fts5( 45 - uri UNINDEXED, 46 - name, 47 - description, 48 - base_path 49 - ) 50 - """ 51 - }, 52 - }, 53 - { 54 - "type": "execute", 55 - "stmt": { 56 - "sql": """ 57 - INSERT INTO publications_fts (uri, name, description, base_path) 58 - SELECT uri, name, COALESCE(description, ''), COALESCE(base_path, '') 59 - FROM publications 60 - """ 61 - }, 62 - }, 63 - {"type": "execute", "stmt": {"sql": "SELECT COUNT(*) FROM publications_fts"}}, 64 - {"type": "close"}, 65 - ] 66 - }, 67 - timeout=60, 68 - ) 69 - response.raise_for_status() 70 - data = response.json() 71 - 72 - for i, result in enumerate(data["results"][:-1]): # skip close 73 - if result["type"] == "error": 74 - print(f"Step {i} error: {result['error']}") 75 - elif result["type"] == "ok": 76 - if i == 3: # count query 77 - rows = result["response"]["result"].get("rows", []) 78 - if rows: 79 - count = ( 80 - rows[0][0].get("value", rows[0][0]) 81 - if isinstance(rows[0][0], dict) 82 - else rows[0][0] 83 - ) 84 - print(f"Rebuilt with {count} publications") 85 - 86 - print("Done!")

-45

site/dashboard.css

··· 100 100 } 101 101 .pub-row:last-child { border-bottom: none; } 102 102 .pub-name { color: #888; } 103 - a.pub-name { color: #1B7340; } 104 - a.pub-name:hover { color: #2a9d5c; } 105 103 .pub-count { color: #666; } 106 - 107 - .timing-row { 108 - display: flex; 109 - justify-content: space-between; 110 - font-size: 12px; 111 - padding: 0.25rem 0; 112 - border-bottom: 1px solid #1a1a1a; 113 - } 114 - .timing-row:last-child { border-bottom: none; } 115 - .timing-name { color: #888; } 116 - .timing-value { color: #ccc; } 117 - .timing-value .dim { color: #555; } 118 - 119 - .latency-chart { 120 - position: relative; 121 - height: 80px; 122 - margin-top: 1rem; 123 - } 124 - .latency-chart canvas { 125 - width: 100%; 126 - height: 100%; 127 - } 128 - .latency-max { 129 - margin-left: auto; 130 - color: #444; 131 - } 132 - .latency-legend { 133 - display: flex; 134 - gap: 1rem; 135 - font-size: 10px; 136 - margin-top: 0.5rem; 137 - } 138 - .latency-legend span { 139 - display: flex; 140 - align-items: center; 141 - gap: 4px; 142 - color: #666; 143 - } 144 - .latency-legend .dot { 145 - width: 8px; 146 - height: 8px; 147 - border-radius: 50%; 148 - } 149 104 150 105 .tags { 151 106 display: flex;

+13 -20

site/dashboard.html

··· 3 3 <head> 4 4 <meta charset="UTF-8"> 5 5 <meta name="viewport" content="width=device-width, initial-scale=1.0"> 6 - <title>pub search / stats</title> 6 + <title>leaflet search / stats</title> 7 7 <link rel="icon" href="data:image/svg+xml,<svg xmlns='http://www.w3.org/2000/svg' viewBox='0 0 32 32'><rect x='4' y='18' width='6' height='10' fill='%231B7340'/><rect x='13' y='12' width='6' height='16' fill='%231B7340'/><rect x='22' y='6' width='6' height='22' fill='%231B7340'/></svg>"> 8 8 <link rel="stylesheet" href="dashboard.css"> 9 9 </head> 10 10 <body> 11 11 <div class="container"> 12 - <h1><a href="https://pub-search.waow.tech" class="title">pub search</a> <span class="dim">/ stats</span></h1> 12 + <h1><a href="https://leaflet-search.pages.dev" class="title">leaflet search</a> <span class="dim">/ stats</span></h1> 13 13 14 14 <section> 15 15 <div class="metrics"> ··· 30 30 </section> 31 31 32 32 <section> 33 - <div class="section-title">documents by platform</div> 33 + <div class="section-title">documents</div> 34 34 <div class="chart-box"> 35 - <div id="platforms"></div> 35 + <div class="doc-row"> 36 + <span class="doc-type">articles</span> 37 + <span class="doc-count" id="articles">--</span> 38 + </div> 39 + <div class="doc-row"> 40 + <span class="doc-type">looseleafs</span> 41 + <span class="doc-count" id="looseleafs">--</span> 42 + </div> 36 43 </div> 37 44 </section> 38 45 39 46 <section> 40 - <div class="section-title">request latency</div> 41 - <div class="chart-box"> 42 - <div id="timing"></div> 43 - </div> 44 - </section> 45 - 46 - <section> 47 - <div class="section-title">latency history (24h)</div> 48 - <div class="chart-box"> 49 - <div id="latency-history"></div> 50 - </div> 51 - </section> 52 - 53 - <section> 54 - <div class="section-title">documents indexed (last 30 days)</div> 47 + <div class="section-title">activity (last 30 days)</div> 55 48 <div class="chart-box"> 56 49 <div class="timeline" id="timeline"></div> 57 50 </div> ··· 70 63 </section> 71 64 72 65 <footer> 73 - <a href="https://pub-search.waow.tech">back</a> · source on <a href="https://tangled.sh/@zzstoatzz.io/leaflet-search">tangled</a> 66 + <a href="https://leaflet-search.pages.dev">back</a> · source on <a href="https://tangled.sh/@zzstoatzz.io/leaflet-search">tangled</a> 74 67 </footer> 75 68 </div> 76 69

+4 -141

site/dashboard.js

··· 47 47 pubs.forEach(p => { 48 48 const row = document.createElement('div'); 49 49 row.className = 'pub-row'; 50 - const nameHtml = p.basePath 51 - ? '<a href="https://' + escapeHtml(p.basePath) + '" target="_blank" class="pub-name">' + escapeHtml(p.name) + '</a>' 52 - : '<span class="pub-name">' + escapeHtml(p.name) + '</span>'; 53 - row.innerHTML = nameHtml + '<span class="pub-count">' + p.count + '</span>'; 50 + row.innerHTML = '<span class="pub-name">' + escapeHtml(p.name) + '</span><span class="pub-count">' + p.count + '</span>'; 54 51 el.appendChild(row); 55 52 }); 56 53 } ··· 60 57 if (!tags) return; 61 58 62 59 el.innerHTML = tags.slice(0, 20).map(t => 63 - '<a class="tag" href="https://pub-search.waow.tech/?tag=' + encodeURIComponent(t.tag) + '">' + 60 + '<a class="tag" href="https://leaflet-search.pages.dev/?tag=' + encodeURIComponent(t.tag) + '">' + 64 61 escapeHtml(t.tag) + '<span class="n">' + t.count + '</span></a>' 65 62 ).join(''); 66 63 } 67 64 68 - function renderPlatforms(platforms) { 69 - const el = document.getElementById('platforms'); 70 - if (!platforms) return; 71 - 72 - platforms.forEach(p => { 73 - const row = document.createElement('div'); 74 - row.className = 'doc-row'; 75 - row.innerHTML = '<span class="doc-type">' + escapeHtml(p.platform) + '</span><span class="doc-count">' + p.count + '</span>'; 76 - el.appendChild(row); 77 - }); 78 - } 79 - 80 - function formatMs(ms) { 81 - if (ms >= 1000) return (ms / 1000).toFixed(1) + 's'; 82 - if (ms >= 10) return ms.toFixed(0) + 'ms'; 83 - if (ms >= 1) return ms.toFixed(1) + 'ms'; 84 - return Math.round(ms * 1000) + 'µs'; 85 - } 86 - 87 - const ENDPOINT_COLORS = { search: '#8b5cf6', similar: '#06b6d4', tags: '#10b981', popular: '#f59e0b' }; 88 - 89 - function renderTiming(timing) { 90 - const el = document.getElementById('timing'); 91 - if (!timing) return; 92 - 93 - const endpoints = ['search', 'similar', 'tags', 'popular']; 94 - endpoints.forEach(name => { 95 - const t = timing[name]; 96 - if (!t) return; 97 - 98 - const row = document.createElement('div'); 99 - row.className = 'timing-row'; 100 - const color = ENDPOINT_COLORS[name]; 101 - 102 - if (t.count === 0) { 103 - row.innerHTML = '<span class="timing-name" style="color:' + color + '">' + name + '</span><span class="timing-value dim">--</span>'; 104 - } else { 105 - row.innerHTML = '<span class="timing-name" style="color:' + color + '">' + name + '</span>' + 106 - '<span class="timing-value">' + formatMs(t.p50_ms) + ' <span class="dim">p50</span> · ' + 107 - formatMs(t.p95_ms) + ' <span class="dim">p95</span></span>'; 108 - } 109 - el.appendChild(row); 110 - }); 111 - 112 - // render line chart for history 113 - renderLatencyChart(timing); 114 - } 115 - 116 - function renderLatencyChart(timing) { 117 - const container = document.getElementById('latency-history'); 118 - if (!container) return; 119 - 120 - const endpoints = ['search', 'similar', 'tags', 'popular']; 121 - 122 - // check if any endpoint has history data 123 - const hasData = endpoints.some(name => timing[name]?.history?.some(h => h.count > 0)); 124 - if (!hasData) { 125 - container.innerHTML = '<div style="color:#444;font-size:11px;text-align:center;padding:2rem">no data yet</div>'; 126 - return; 127 - } 128 - 129 - const canvas = document.createElement('canvas'); 130 - const chartDiv = document.createElement('div'); 131 - chartDiv.className = 'latency-chart'; 132 - chartDiv.appendChild(canvas); 133 - container.appendChild(chartDiv); 134 - 135 - const ctx = canvas.getContext('2d'); 136 - const dpr = window.devicePixelRatio || 1; 137 - const rect = chartDiv.getBoundingClientRect(); 138 - canvas.width = rect.width * dpr; 139 - canvas.height = rect.height * dpr; 140 - ctx.scale(dpr, dpr); 141 - 142 - const w = rect.width; 143 - const h = rect.height; 144 - const padding = { top: 10, right: 10, bottom: 20, left: 10 }; 145 - const chartW = w - padding.left - padding.right; 146 - const chartH = h - padding.top - padding.bottom; 147 - 148 - // find max value across all endpoints 149 - let maxVal = 0; 150 - endpoints.forEach(name => { 151 - const history = timing[name]?.history || []; 152 - history.forEach(p => { if (p.avg_ms > maxVal) maxVal = p.avg_ms; }); 153 - }); 154 - if (maxVal === 0) maxVal = 100; 155 - 156 - // draw each endpoint as an area chart 157 - endpoints.forEach(name => { 158 - const history = timing[name]?.history || []; 159 - if (history.length === 0) return; 160 - 161 - const color = ENDPOINT_COLORS[name]; 162 - const points = history.map((p, i) => ({ 163 - x: padding.left + (i / (history.length - 1)) * chartW, 164 - y: padding.top + chartH - (p.avg_ms / maxVal) * chartH 165 - })); 166 - 167 - // draw filled area 168 - ctx.beginPath(); 169 - ctx.moveTo(points[0].x, padding.top + chartH); 170 - points.forEach(p => ctx.lineTo(p.x, p.y)); 171 - ctx.lineTo(points[points.length - 1].x, padding.top + chartH); 172 - ctx.closePath(); 173 - ctx.fillStyle = color + '20'; 174 - ctx.fill(); 175 - 176 - // draw line 177 - ctx.beginPath(); 178 - ctx.moveTo(points[0].x, points[0].y); 179 - for (let i = 1; i < points.length; i++) { 180 - ctx.lineTo(points[i].x, points[i].y); 181 - } 182 - ctx.strokeStyle = color; 183 - ctx.lineWidth = 1.5; 184 - ctx.stroke(); 185 - }); 186 - 187 - // legend with max value 188 - const legend = document.createElement('div'); 189 - legend.className = 'latency-legend'; 190 - endpoints.forEach(name => { 191 - const span = document.createElement('span'); 192 - span.innerHTML = '<span class="dot" style="background:' + ENDPOINT_COLORS[name] + '"></span>' + name; 193 - legend.appendChild(span); 194 - }); 195 - const maxLabel = document.createElement('span'); 196 - maxLabel.className = 'latency-max'; 197 - maxLabel.textContent = formatMs(maxVal) + ' max'; 198 - legend.appendChild(maxLabel); 199 - container.appendChild(legend); 200 - } 201 - 202 65 function escapeHtml(str) { 203 66 return str 204 67 .replace(/&/g, '&') ··· 220 83 221 84 document.getElementById('searches').textContent = data.searches; 222 85 document.getElementById('publications').textContent = data.publications; 86 + document.getElementById('articles').textContent = data.articles; 87 + document.getElementById('looseleafs').textContent = data.looseleafs; 223 88 224 - renderPlatforms(data.platforms); 225 - renderTiming(data.timing); 226 89 renderTimeline(data.timeline); 227 90 renderPubs(data.topPubs); 228 91 renderTags(data.tags);

+51 -399

site/index.html

··· 4 4 <meta charset="UTF-8"> 5 5 <meta name="viewport" content="width=device-width, initial-scale=1.0"> 6 6 <link rel="icon" type="image/svg+xml" href="/favicon.svg"> 7 - <title>pub search</title> 8 - <meta name="description" content="search atproto publishing platforms"> 9 - <meta property="og:title" content="pub search"> 10 - <meta property="og:description" content="search atproto publishing platforms"> 7 + <title>leaflet search</title> 8 + <meta name="description" content="search for leaflet"> 9 + <meta property="og:title" content="leaflet search"> 10 + <meta property="og:description" content="search for leaflet"> 11 11 <meta property="og:type" content="website"> 12 12 <meta name="twitter:card" content="summary"> 13 - <meta name="twitter:title" content="pub search"> 14 - <meta name="twitter:description" content="search atproto publishing platforms"> 13 + <meta name="twitter:title" content="leaflet search"> 14 + <meta name="twitter:description" content="search for leaflet"> 15 15 <style> 16 16 * { box-sizing: border-box; margin: 0; padding: 0; } 17 17 ··· 75 75 flex: 1; 76 76 padding: 0.5rem; 77 77 font-family: monospace; 78 - font-size: 16px; /* prevents iOS auto-zoom on focus */ 78 + font-size: 14px; 79 79 background: #111; 80 80 border: 1px solid #333; 81 81 color: #ccc; ··· 111 111 .result-title { 112 112 color: #fff; 113 113 margin-bottom: 0.5rem; 114 - /* prevent long titles from breaking layout */ 115 - display: -webkit-box; 116 - -webkit-line-clamp: 2; 117 - -webkit-box-orient: vertical; 118 - overflow: hidden; 119 - word-break: break-word; 120 114 } 121 115 122 116 .result-title a { color: inherit; } ··· 331 325 margin-left: 4px; 332 326 } 333 327 334 - input.tag-input { 335 - /* match .tag styling exactly */ 336 - font-size: 11px; 337 - padding: 3px 8px; 338 - background: #151515; 339 - border: 1px solid #252525; 340 - border-radius: 3px; 341 - color: #777; 342 - font-family: monospace; 343 - /* prevent flex expansion from global input[type="text"] */ 344 - flex: none; 345 - width: 90px; 346 - } 347 - 348 - input.tag-input:hover { 349 - background: #1a1a1a; 350 - border-color: #333; 351 - color: #aaa; 352 - } 353 - 354 - input.tag-input:focus { 355 - outline: none; 356 - border-color: #1B7340; 357 - background: rgba(27, 115, 64, 0.1); 358 - color: #ccc; 359 - } 360 - 361 - input.tag-input::placeholder { 362 - color: #555; 363 - } 364 - 365 - .platform-filter { 366 - margin-bottom: 1rem; 367 - } 368 - 369 - .platform-filter-label { 370 - font-size: 11px; 371 - color: #444; 372 - margin-bottom: 0.5rem; 373 - } 374 - 375 - .platform-filter-list { 376 - display: flex; 377 - gap: 0.5rem; 378 - } 379 - 380 - .platform-option { 381 - font-size: 11px; 382 - padding: 3px 8px; 383 - background: #151515; 384 - border: 1px solid #252525; 385 - border-radius: 3px; 386 - cursor: pointer; 387 - color: #777; 388 - } 389 - 390 - .platform-option:hover { 391 - background: #1a1a1a; 392 - border-color: #333; 393 - color: #aaa; 394 - } 395 - 396 - .platform-option.active { 397 - background: rgba(180, 100, 64, 0.2); 398 - border-color: #d4956a; 399 - color: #d4956a; 400 - } 401 - 402 328 .active-filter { 403 329 display: flex; 404 330 align-items: center; ··· 420 346 .active-filter .clear:hover { 421 347 color: #c44; 422 348 } 423 - 424 - /* mobile improvements */ 425 - @media (max-width: 600px) { 426 - body { 427 - padding: 0.75rem; 428 - font-size: 13px; 429 - } 430 - 431 - .container { 432 - max-width: 100%; 433 - } 434 - 435 - /* ensure minimum 44px touch targets */ 436 - .tag, .platform-option, .suggestion, input.tag-input { 437 - min-height: 44px; 438 - display: inline-flex; 439 - align-items: center; 440 - padding: 0.5rem 0.75rem; 441 - } 442 - 443 - input.tag-input { 444 - width: 80px; 445 - } 446 - 447 - button { 448 - min-height: 44px; 449 - padding: 0.5rem 0.75rem; 450 - } 451 - 452 - /* stack search box on very small screens */ 453 - .search-box { 454 - flex-direction: column; 455 - gap: 0.5rem; 456 - } 457 - 458 - .search-box input[type="text"] { 459 - width: 100%; 460 - } 461 - 462 - .search-box button { 463 - width: 100%; 464 - } 465 - 466 - /* result card mobile tweaks */ 467 - .result { 468 - padding: 0.75rem 0; 469 - } 470 - 471 - .result:hover { 472 - margin: 0 -0.75rem; 473 - padding: 0.75rem; 474 - } 475 - 476 - .result-title { 477 - font-size: 14px; 478 - line-height: 1.4; 479 - } 480 - 481 - .result-snippet { 482 - font-size: 12px; 483 - line-height: 1.5; 484 - } 485 - 486 - /* badges inline on mobile */ 487 - .entity-type, .platform-badge { 488 - font-size: 9px; 489 - padding: 2px 5px; 490 - margin-right: 6px; 491 - vertical-align: middle; 492 - } 493 - 494 - /* tags wrap better on mobile */ 495 - .tags-list, .platform-filter-list { 496 - gap: 0.5rem; 497 - } 498 - 499 - /* suggestions responsive */ 500 - .suggestions { 501 - line-height: 2; 502 - } 503 - 504 - /* related items more compact */ 505 - .related-item { 506 - max-width: 150px; 507 - font-size: 11px; 508 - padding: 0.5rem; 509 - } 510 - } 511 - 512 - /* ensure touch targets on tablets too */ 513 - @media (hover: none) and (pointer: coarse) { 514 - .tag, .platform-option, .suggestion, .related-item, input.tag-input { 515 - min-height: 44px; 516 - display: inline-flex; 517 - align-items: center; 518 - } 519 - } 520 349 </style> 521 350 </head> 522 351 <body> 523 352 <div class="container"> 524 - <h1><a href="/" class="title">pub search</a> <span class="by">by <a href="https://bsky.app/profile/zzstoatzz.io" target="_blank">@zzstoatzz.io</a></span> <a href="https://tangled.sh/@zzstoatzz.io/leaflet-search" target="_blank" class="src">[src]</a></h1> 353 + <h1><a href="/" class="title">leaflet search</a> <span class="by">by <a href="https://bsky.app/profile/zzstoatzz.io" target="_blank">@zzstoatzz.io</a></span> <a href="https://tangled.sh/@zzstoatzz.io/leaflet-search" target="_blank" class="src">[src]</a></h1> 525 354 526 355 <div class="search-box"> 527 356 <input type="text" id="query" placeholder="search content..." autofocus> ··· 534 363 535 364 <div id="tags" class="tags"></div> 536 365 537 - <div id="platform-filter" class="platform-filter"></div> 538 - 539 366 <div id="results" class="results"> 540 367 <div class="empty-state"> 541 - <p>search atproto publishing platforms</p> 542 - <p style="font-size:11px;margin-top:0.5rem"><a href="https://leaflet.pub" target="_blank">leaflet</a> · <a href="https://pckt.blog" target="_blank">pckt</a> · <a href="https://offprint.app" target="_blank">offprint</a> · <a href="https://greengale.app" target="_blank">greengale</a> · <a href="https://standard.site" target="_blank">other</a></p> 368 + <p>search for <a href="https://leaflet.pub" target="_blank">leaflet.pub</a></p> 543 369 </div> 544 370 </div> 545 371 ··· 558 384 const tagsDiv = document.getElementById('tags'); 559 385 const activeFilterDiv = document.getElementById('active-filter'); 560 386 const suggestionsDiv = document.getElementById('suggestions'); 561 - const platformFilterDiv = document.getElementById('platform-filter'); 562 387 563 388 let currentTag = null; 564 - let currentPlatform = null; 565 389 let allTags = []; 566 390 let popularSearches = []; 567 391 568 - async function search(query, tag = null, platform = null) { 569 - if (!query.trim() && !tag && !platform) return; 392 + async function search(query, tag = null) { 393 + if (!query.trim() && !tag) return; 570 394 571 395 searchBtn.disabled = true; 572 396 let searchUrl = `${API_URL}/search?q=${encodeURIComponent(query || '')}`; 573 397 if (tag) searchUrl += `&tag=${encodeURIComponent(tag)}`; 574 - if (platform) searchUrl += `&platform=${encodeURIComponent(platform)}`; 575 398 resultsDiv.innerHTML = `<div class="status">searching...</div>`; 576 399 577 400 try { ··· 594 417 if (results.length === 0) { 595 418 resultsDiv.innerHTML = ` 596 419 <div class="empty-state"> 597 - <p>no results${query ? ` for ${formatQueryForDisplay(query)}` : ''}${tag ? ` in #${escapeHtml(tag)}` : ''}${platform ? ` on ${escapeHtml(platform)}` : ''}</p> 420 + <p>no results${query ? ` for "${escapeHtml(query)}"` : ''}${tag ? ` in #${escapeHtml(tag)}` : ''}</p> 598 421 <p>try different keywords</p> 599 422 </div> 600 423 `; ··· 606 429 607 430 for (const doc of results) { 608 431 const entityType = doc.type || 'article'; 609 - const platform = doc.platform || 'leaflet'; 610 - 611 - // build URL based on entity type and platform 612 - const docUrl = buildDocUrl(doc, entityType, platform); 613 - // only show platform badge for actual platforms, not for lexicon-only records 614 - const platformConfig = PLATFORM_CONFIG[platform]; 615 - const platformBadge = platformConfig 616 - ? `<span class="platform-badge">${escapeHtml(platformConfig.label)}</span>` 617 - : ''; 618 - const date = doc.createdAt ? new Date(doc.createdAt).toLocaleDateString() : ''; 619 432 620 - // platform home URL for meta link 621 - const platformHome = getPlatformHome(platform, doc.basePath); 433 + // build URL based on entity type 434 + let leafletUrl = null; 435 + if (entityType === 'publication') { 436 + // publications link to their base path 437 + leafletUrl = doc.basePath ? `https://${doc.basePath}` : null; 438 + } else { 439 + // articles and looseleafs link to specific document 440 + leafletUrl = doc.basePath && doc.rkey 441 + ? `https://${doc.basePath}/${doc.rkey}` 442 + : (doc.did && doc.rkey ? `https://leaflet.pub/p/${doc.did}/${doc.rkey}` : null); 443 + } 622 444 445 + const date = doc.createdAt ? new Date(doc.createdAt).toLocaleDateString() : ''; 446 + const platform = doc.platform || 'leaflet'; 447 + const platformBadge = platform !== 'leaflet' ? `<span class="platform-badge">${escapeHtml(platform)}</span>` : ''; 623 448 html += ` 624 449 <div class="result"> 625 450 <div class="result-title"> 626 451 <span class="entity-type ${entityType}">${entityType}</span>${platformBadge} 627 - ${docUrl 628 - ? `<a href="${docUrl}" target="_blank">${escapeHtml(doc.title || 'Untitled')}</a>` 452 + ${leafletUrl 453 + ? `<a href="${leafletUrl}" target="_blank">${escapeHtml(doc.title || 'Untitled')}</a>` 629 454 : escapeHtml(doc.title || 'Untitled')} 630 455 </div> 631 456 <div class="result-snippet">${highlightTerms(doc.snippet, query)}</div> 632 457 <div class="result-meta"> 633 - ${date ? `${date} | ` : ''}${platformHome.url 634 - ? `<a href="${platformHome.url}" target="_blank">${platformHome.label}</a>` 635 - : platformHome.label} 458 + ${date ? `${date} | ` : ''}${doc.basePath 459 + ? `<a href="https://${doc.basePath}" target="_blank">${doc.basePath}</a>` 460 + : `<a href="https://leaflet.pub" target="_blank">leaflet.pub</a>`} 636 461 </div> 637 462 </div> 638 463 `; ··· 660 485 })[c]); 661 486 } 662 487 663 - // display query without adding redundant quotes 664 - function formatQueryForDisplay(query) { 665 - if (!query) return ''; 666 - const escaped = escapeHtml(query); 667 - // if query is already fully quoted, don't add more quotes 668 - if (query.startsWith('"') && query.endsWith('"')) { 669 - return escaped; 670 - } 671 - return `"${escaped}"`; 672 - } 673 - 674 - // platform-specific URL patterns 675 - // note: some platforms use basePath from publication, which we prefer 676 - // fallback docUrl() is used when basePath is missing 677 - const PLATFORM_CONFIG = { 678 - leaflet: { 679 - home: 'https://leaflet.pub', 680 - label: 'leaflet.pub', 681 - // leaflet uses did/rkey pattern for fallback URLs 682 - docUrl: (did, rkey) => `https://leaflet.pub/p/${did}/${rkey}` 683 - }, 684 - pckt: { 685 - home: 'https://pckt.blog', 686 - label: 'pckt.blog', 687 - // pckt uses blog slugs + path, not did/rkey - needs basePath from publication 688 - docUrl: null 689 - }, 690 - offprint: { 691 - home: 'https://offprint.app', 692 - label: 'offprint.app', 693 - // offprint is in early beta, URL pattern unknown 694 - docUrl: null 695 - }, 696 - greengale: { 697 - home: 'https://greengale.app', 698 - label: 'greengale.app', 699 - // greengale uses basePath + path pattern 700 - docUrl: null 701 - }, 702 - other: { 703 - home: 'https://standard.site', 704 - label: 'other', 705 - // "other" = site.standard.* documents not from a known platform 706 - docUrl: null 707 - }, 708 - }; 709 - 710 - function buildDocUrl(doc, entityType, platform) { 711 - if (entityType === 'publication') { 712 - return doc.basePath ? `https://${doc.basePath}` : null; 713 - } 714 - 715 - // Platform-specific URL patterns: 716 - // 1. Leaflet: basePath + rkey (e.g., https://dad.leaflet.pub/3mburumcnbs2m) 717 - if (platform === 'leaflet' && doc.basePath && doc.rkey) { 718 - return `https://${doc.basePath}/${doc.rkey}`; 719 - } 720 - 721 - // 2. pckt: basePath + path (e.g., https://devlog.pckt.blog/some-slug-abc123) 722 - if (platform === 'pckt' && doc.basePath && doc.path) { 723 - const separator = doc.path.startsWith('/') ? '' : '/'; 724 - return `https://${doc.basePath}${separator}${doc.path}`; 725 - } 726 - 727 - // 3. Other platforms with path: basePath + path 728 - if (doc.basePath && doc.path) { 729 - const separator = doc.path.startsWith('/') ? '' : '/'; 730 - return `https://${doc.basePath}${separator}${doc.path}`; 731 - } 732 - 733 - // 4. Platform-specific fallback URL (e.g., leaflet.pub/p/did/rkey) 734 - const config = PLATFORM_CONFIG[platform]; 735 - if (config?.docUrl && doc.did && doc.rkey) { 736 - return config.docUrl(doc.did, doc.rkey); 737 - } 738 - 739 - // 5. Fallback: pdsls.dev universal viewer (always works for any AT Protocol record) 740 - if (doc.uri) { 741 - return `https://pdsls.dev/${doc.uri}`; 742 - } 743 - 744 - return null; 745 - } 746 - 747 - function getPlatformHome(platform, basePath) { 748 - if (basePath) { 749 - return { url: `https://${basePath}`, label: basePath }; 750 - } 751 - const config = PLATFORM_CONFIG[platform]; 752 - if (config) { 753 - return { url: config.home, label: config.label }; 754 - } 755 - // fallback for documents without a known platform - link to standard.site lexicon 756 - return { url: 'https://standard.site', label: 'other' }; 757 - } 758 - 759 488 function highlightTerms(text, query) { 760 489 if (!text || !query) return escapeHtml(text); 761 - 762 - // extract terms: quoted phrases as single terms, or split by whitespace 763 - const terms = []; 764 - const phraseRegex = /"([^"]+)"/g; 765 - let match; 766 - let remaining = query.toLowerCase(); 767 - 768 - // extract quoted phrases first 769 - while ((match = phraseRegex.exec(query.toLowerCase())) !== null) { 770 - terms.push(match[1]); // the phrase without quotes 771 - remaining = remaining.replace(match[0], ' '); 772 - } 773 - 774 - // add remaining non-quoted terms 775 - remaining.split(/\s+/).filter(t => t.length > 0).forEach(t => terms.push(t)); 776 - 490 + const terms = query.toLowerCase().split(/\s+/).filter(t => t.length > 0); 777 491 if (terms.length === 0) return escapeHtml(text); 778 492 779 493 // build regex that matches any term (case insensitive) ··· 789 503 const q = queryInput.value.trim(); 790 504 if (q) params.set('q', q); 791 505 if (currentTag) params.set('tag', currentTag); 792 - if (currentPlatform) params.set('platform', currentPlatform); 793 506 const url = params.toString() ? `?${params}` : '/'; 794 507 history.pushState(null, '', url); 795 508 } 796 509 797 510 function doSearch() { 798 511 updateUrl(); 799 - search(queryInput.value, currentTag, currentPlatform); 512 + search(queryInput.value, currentTag); 800 513 } 801 514 802 515 function setTag(tag) { 803 - if (currentTag === tag) { 804 - clearTag(); 805 - return; 806 - } 807 516 currentTag = tag; 808 517 renderActiveFilter(); 809 518 renderTags(); ··· 815 524 renderActiveFilter(); 816 525 renderTags(); 817 526 updateUrl(); 818 - if (queryInput.value.trim() || currentPlatform) { 819 - search(queryInput.value, null, currentPlatform); 820 - } else { 821 - renderEmptyState(); 822 - } 823 - } 824 - 825 - function setPlatform(platform) { 826 - if (currentPlatform === platform) { 827 - clearPlatform(); 828 - return; 829 - } 830 - currentPlatform = platform; 831 - renderActiveFilter(); 832 - renderPlatformFilter(); 833 - doSearch(); 834 - } 835 - 836 - function clearPlatform() { 837 - currentPlatform = null; 838 - renderActiveFilter(); 839 - renderPlatformFilter(); 840 - updateUrl(); 841 - if (queryInput.value.trim() || currentTag) { 842 - search(queryInput.value, currentTag, null); 527 + if (queryInput.value.trim()) { 528 + search(queryInput.value, null); 843 529 } else { 844 530 renderEmptyState(); 845 531 } 846 532 } 847 533 848 - function renderPlatformFilter() { 849 - const platforms = [ 850 - { id: 'leaflet', label: 'leaflet' }, 851 - { id: 'pckt', label: 'pckt' }, 852 - { id: 'offprint', label: 'offprint' }, 853 - { id: 'greengale', label: 'greengale' }, 854 - { id: 'other', label: 'other' }, 855 - ]; 856 - const html = platforms.map(p => ` 857 - <span class="platform-option${currentPlatform === p.id ? ' active' : ''}" onclick="setPlatform('${p.id}')">${p.label}</span> 858 - `).join(''); 859 - platformFilterDiv.innerHTML = `<div class="platform-filter-label">filter by platform:</div><div class="platform-filter-list">${html}</div>`; 860 - } 861 - 862 534 function renderActiveFilter() { 863 - if (!currentTag && !currentPlatform) { 535 + if (!currentTag) { 864 536 activeFilterDiv.innerHTML = ''; 865 537 return; 866 538 } 867 - let parts = []; 868 - if (currentTag) parts.push(`tag: <strong>#${escapeHtml(currentTag)}</strong>`); 869 - if (currentPlatform) parts.push(`platform: <strong>${escapeHtml(currentPlatform)}</strong>`); 870 - const clearActions = []; 871 - if (currentTag) clearActions.push(`<span class="clear" onclick="clearTag()">× tag</span>`); 872 - if (currentPlatform) clearActions.push(`<span class="clear" onclick="clearPlatform()">× platform</span>`); 873 539 activeFilterDiv.innerHTML = ` 874 540 <div class="active-filter"> 875 - <span>filtering by ${parts.join(', ')} <span style="color:#666;font-size:10px">(documents only)</span></span> 876 - ${clearActions.join(' ')} 541 + <span>filtering by tag: <strong>#${escapeHtml(currentTag)}</strong> <span style="color:#666;font-size:10px">(documents only)</span></span> 542 + <span class="clear" onclick="clearTag()">× clear</span> 877 543 </div> 878 544 `; 879 545 } 880 546 881 547 function renderTags() { 882 - const tagsHtml = allTags.slice(0, 15).map(t => ` 548 + if (allTags.length === 0) { 549 + tagsDiv.innerHTML = ''; 550 + return; 551 + } 552 + const html = allTags.slice(0, 15).map(t => ` 883 553 <span class="tag${currentTag === t.tag ? ' active' : ''}" onclick="setTag('${escapeHtml(t.tag)}')">${escapeHtml(t.tag)}<span class="count">${t.count}</span></span> 884 554 `).join(''); 885 - const inputHtml = `<input type="text" class="tag-input" id="tag-input" placeholder="enter tag..." value="${currentTag && !allTags.some(t => t.tag === currentTag) ? escapeHtml(currentTag) : ''}">`; 886 - tagsDiv.innerHTML = `<div class="tags-label">filter by tag:</div><div class="tags-list">${tagsHtml}${inputHtml}</div>`; 887 - 888 - // bind enter key handler 889 - const tagInput = document.getElementById('tag-input'); 890 - tagInput.addEventListener('keydown', e => { 891 - if (e.key === 'Enter') { 892 - e.preventDefault(); 893 - const val = tagInput.value.trim(); 894 - if (val) { 895 - setTag(val); 896 - } 897 - } 898 - }); 555 + tagsDiv.innerHTML = `<div class="tags-label">filter by tag:</div><div class="tags-list">${html}</div>`; 899 556 } 900 557 901 558 async function loadTags() { ··· 944 601 function renderEmptyState() { 945 602 resultsDiv.innerHTML = ` 946 603 <div class="empty-state"> 947 - <p>search atproto publishing platforms</p> 948 - <p style="font-size:11px;margin-top:0.5rem"><a href="https://leaflet.pub" target="_blank">leaflet</a> · <a href="https://pckt.blog" target="_blank">pckt</a> · <a href="https://offprint.app" target="_blank">offprint</a> · <a href="https://greengale.app" target="_blank">greengale</a> · <a href="https://standard.site" target="_blank">other</a></p> 604 + <p>search for <a href="https://leaflet.pub" target="_blank">leaflet.pub</a></p> 949 605 </div> 950 606 `; 951 607 } ··· 964 620 const params = new URLSearchParams(location.search); 965 621 queryInput.value = params.get('q') || ''; 966 622 currentTag = params.get('tag') || null; 967 - currentPlatform = params.get('platform') || null; 968 623 renderActiveFilter(); 969 624 renderTags(); 970 - renderPlatformFilter(); 971 - if (queryInput.value || currentTag || currentPlatform) search(queryInput.value, currentTag, currentPlatform); 625 + if (queryInput.value || currentTag) search(queryInput.value, currentTag); 972 626 }); 973 627 974 628 // init 975 629 const initialParams = new URLSearchParams(location.search); 976 630 const initialQuery = initialParams.get('q'); 977 631 const initialTag = initialParams.get('tag'); 978 - const initialPlatform = initialParams.get('platform'); 979 632 if (initialQuery) queryInput.value = initialQuery; 980 633 if (initialTag) currentTag = initialTag; 981 - if (initialPlatform) currentPlatform = initialPlatform; 982 634 renderActiveFilter(); 983 - renderPlatformFilter(); 984 635 985 - if (initialQuery || initialTag || initialPlatform) { 986 - search(initialQuery || '', initialTag, initialPlatform); 636 + if (initialQuery || initialTag) { 637 + search(initialQuery || '', initialTag); 987 638 } 988 639 989 640 async function loadRelated(topResult) { ··· 1009 660 if (filtered.length === 0) return; 1010 661 1011 662 const items = filtered.map(doc => { 1012 - const platform = doc.platform || 'leaflet'; 1013 - const url = buildDocUrl(doc, doc.type || 'article', platform); 663 + const url = doc.basePath && doc.rkey 664 + ? `https://${doc.basePath}/${doc.rkey}` 665 + : (doc.did && doc.rkey ? `https://leaflet.pub/p/${doc.did}/${doc.rkey}` : null); 1014 666 return url 1015 667 ? `<a href="${url}" target="_blank" class="related-item">${escapeHtml(doc.title || 'Untitled')}</a>` 1016 668 : `<span class="related-item">${escapeHtml(doc.title || 'Untitled')}</span>`;

+40 -32

site/loading.js

··· 82 82 const style = document.createElement('style'); 83 83 style.id = 'loader-styles'; 84 84 style.textContent = ` 85 - /* skeleton shimmer - subtle pulse */ 85 + /* skeleton shimmer for loading values */ 86 86 .loading .metric-value, 87 87 .loading .doc-count, 88 88 .loading .pub-count { 89 - color: #333 !important; 90 - animation: dim-pulse 2s ease-in-out infinite; 89 + background: linear-gradient(90deg, #1a1a1a 25%, #252525 50%, #1a1a1a 75%); 90 + background-size: 200% 100%; 91 + animation: shimmer 1.5s infinite; 92 + border-radius: 3px; 93 + color: transparent !important; 94 + min-width: 3ch; 95 + display: inline-block; 91 96 } 92 97 93 - @keyframes dim-pulse { 94 - 0%, 100% { opacity: 0.3; } 95 - 50% { opacity: 0.6; } 98 + @keyframes shimmer { 99 + 0% { background-position: 200% 0; } 100 + 100% { background-position: -200% 0; } 96 101 } 97 102 98 - /* wake message - terminal style, ephemeral */ 103 + /* wake message */ 99 104 .wake-message { 100 105 position: fixed; 101 - bottom: 1rem; 102 - left: 1rem; 103 - font-family: monospace; 106 + top: 1rem; 107 + right: 1rem; 104 108 font-size: 11px; 105 - color: #444; 109 + color: #666; 110 + background: #111; 111 + border: 1px solid #222; 112 + padding: 6px 12px; 113 + border-radius: 4px; 114 + display: flex; 115 + align-items: center; 116 + gap: 8px; 106 117 z-index: 1000; 107 - animation: fade-in 0.5s ease; 108 - } 109 - 110 - .wake-message::before { 111 - content: '>'; 112 - margin-right: 6px; 113 - opacity: 0.5; 118 + animation: fade-in 0.2s ease; 114 119 } 115 120 116 121 .wake-dot { 117 - display: inline-block; 118 - width: 4px; 119 - height: 4px; 120 - background: #555; 122 + width: 6px; 123 + height: 6px; 124 + background: #4ade80; 121 125 border-radius: 50%; 122 - margin-left: 4px; 123 - animation: blink 1s step-end infinite; 126 + animation: pulse-dot 1s infinite; 124 127 } 125 128 126 - @keyframes blink { 127 - 0%, 100% { opacity: 1; } 128 - 50% { opacity: 0; } 129 + @keyframes pulse-dot { 130 + 0%, 100% { opacity: 0.3; } 131 + 50% { opacity: 1; } 129 132 } 130 133 131 134 @keyframes fade-in { 132 - from { opacity: 0; } 133 - to { opacity: 1; } 135 + from { opacity: 0; transform: translateY(-4px); } 136 + to { opacity: 1; transform: translateY(0); } 134 137 } 135 138 136 139 .wake-message.fade-out { 137 - animation: fade-out 0.5s ease forwards; 140 + animation: fade-out 0.3s ease forwards; 138 141 } 139 142 140 143 @keyframes fade-out { 141 - to { opacity: 0; } 144 + to { opacity: 0; transform: translateY(-4px); } 142 145 } 143 146 144 147 /* loaded transition */ 145 148 .loaded .metric-value, 146 149 .loaded .doc-count, 147 150 .loaded .pub-count { 148 - animation: none; 151 + animation: reveal 0.3s ease; 152 + } 153 + 154 + @keyframes reveal { 155 + from { opacity: 0; } 156 + to { opacity: 1; } 149 157 } 150 158 `; 151 159 document.head.appendChild(style);

+4 -7

tap/fly.toml

··· 1 1 app = 'leaflet-search-tap' 2 - primary_region = 'ewr' 2 + primary_region = 'iad' 3 3 4 4 [build] 5 5 image = 'ghcr.io/bluesky-social/indigo/tap:latest' ··· 8 8 TAP_DATABASE_URL = 'sqlite:///data/tap.db' 9 9 TAP_BIND = ':2480' 10 10 TAP_RELAY_URL = 'https://relay1.us-east.bsky.network' 11 - TAP_SIGNAL_COLLECTION = 'site.standard.document' 12 - TAP_COLLECTION_FILTERS = 'pub.leaflet.document,pub.leaflet.publication,site.standard.document,site.standard.publication' 11 + TAP_SIGNAL_COLLECTION = 'pub.leaflet.document' 12 + TAP_COLLECTION_FILTERS = 'pub.leaflet.document,pub.leaflet.publication' 13 + TAP_DISABLE_ACKS = 'true' 13 14 TAP_LOG_LEVEL = 'info' 14 - TAP_RESYNC_PARALLELISM = '1' 15 - TAP_FIREHOSE_PARALLELISM = '5' 16 - TAP_OUTBOX_CAPACITY = '10000' 17 - TAP_IDENT_CACHE_SIZE = '10000' 18 15 TAP_CURSOR_SAVE_INTERVAL = '5s' 19 16 TAP_REPO_FETCH_TIMEOUT = '600s' 20 17

-36

tap/justfile

··· 1 1 # tap instance for leaflet-search 2 2 3 - # get machine id 4 - _machine_id := `fly status --app leaflet-search-tap --json 2>/dev/null | jq -r '.Machines[0].id'` 5 - 6 - # crank up parallelism for faster catch-up (uses more memory + CPU) 7 - turbo: 8 - @echo "Switching to TURBO mode (4GB, 2 CPUs, higher parallelism)..." 9 - fly machine update {{ _machine_id }} --app leaflet-search-tap \ 10 - --vm-memory 4096 \ 11 - --vm-cpus 2 \ 12 - -e TAP_RESYNC_PARALLELISM=4 \ 13 - -e TAP_FIREHOSE_PARALLELISM=10 \ 14 - --yes 15 - @echo "TURBO mode enabled. Run 'just normal' when caught up." 16 - 17 - # restore normal settings (lower memory, conservative parallelism) 18 - normal: 19 - @echo "Switching to NORMAL mode (2GB, 1 CPU, conservative parallelism)..." 20 - fly machine update {{ _machine_id }} --app leaflet-search-tap \ 21 - --vm-memory 2048 \ 22 - --vm-cpus 1 \ 23 - -e TAP_RESYNC_PARALLELISM=1 \ 24 - -e TAP_FIREHOSE_PARALLELISM=5 \ 25 - --yes 26 - @echo "NORMAL mode restored." 27 - 28 - # check indexing status - shows most recent indexed documents 29 - check: 30 - @echo "=== tap status ===" 31 - @fly status --app leaflet-search-tap 2>/dev/null | grep -E "(STATE|started|stopped)" 32 - @echo "" 33 - @echo "=== Recent Indexing Activity ===" 34 - @curl -s https://leaflet-search-backend.fly.dev/api/dashboard | jq -r '"Last indexed: \(.timeline[0].date) (\(.timeline[0].count) docs)\nToday: '$(date +%Y-%m-%d)'\nDocs: \(.documents) | Pubs: \(.publications)"' 35 - @echo "" 36 - @echo "=== Timeline (last 7 days) ===" 37 - @curl -s https://leaflet-search-backend.fly.dev/api/dashboard | jq -r '.timeline[:7][] | "\(.date): \(.count) docs"' 38 - 39 3 deploy: 40 4 fly deploy --app leaflet-search-tap 41 5

Compare changes