···11-# Twisted
11+# Twisted Monorepo
2233-A mobile client for [Tangled](https://tangled.org).
33+- `apps/twisted`: Ionic/Vue client
44+- `packages/api`: Go API copied from `~/Projects/TWISTER`
4556## Development
6777-Run the mobile apps with Capacitor:
88+Use the top-level `justfile` for common tasks:
89910```bash
1010-pnpm cap run ios
1111-pnpm cap run android
1111+just dev
1212+just build
1313+just test
1214```
13151414-Or to test the web version:
1515-1616-```bash
1717-pnpm dev
1818-```
1616+The existing client package still works directly from `apps/twisted`.
···11+# Twisted
22+33+A mobile client for [Tangled](https://tangled.org).
44+55+## Development
66+77+Run the mobile apps with Capacitor:
88+99+```bash
1010+pnpm cap run ios
1111+pnpm cap run android
1212+```
1313+1414+Or to test the web version:
1515+1616+```bash
1717+pnpm dev
1818+```
···11+# If you prefer the allow list template instead of the deny list, see community template:
22+# https://github.com/github/gitignore/blob/main/community/Golang/Go.AllowList.gitignore
33+#
44+# Binaries for programs and plugins
55+twister
66+*.exe
77+*.exe~
88+*.dll
99+*.so
1010+*.dylib
1111+1212+# Test binary, built with `go test -c`
1313+*.test
1414+1515+# Code coverage profiles and other test artifacts
1616+*.out
1717+coverage.*
1818+*.coverprofile
1919+profile.cov
2020+2121+# Dependency directories (remove the comment below to include it)
2222+# vendor/
2323+2424+# Go workspace file
2525+go.work
2626+go.work.sum
2727+2828+# env file
2929+.env
3030+3131+# Editor/IDE
3232+# .idea/
3333+# .vscode/
···11+# Twister
22+33+Tap-based search engine for Tangled.
+292
packages/api/docs/specs/01-architecture.md
···11+---
22+title: "Spec 01 — Architecture"
33+updated: 2026-03-22
44+---
55+66+## 1. Purpose
77+88+Build a Go-based search service for Tangled content on AT Protocol that:
99+1010+* ingests Tangled records through **Tap** (already deployed on Railway)
1111+* denormalizes them into internal search documents
1212+* indexes them in **Turso/libSQL**
1313+* exposes a search API with **keyword**, **semantic**, and **hybrid** retrieval modes
1414+1515+## 2. Functional Goals
1616+1717+The system shall:
1818+1919+* index Tangled-specific ATProto collections under the `sh.tangled.*` namespace
2020+* support initial backfill and continuous incremental sync via Tap
2121+* support lexical retrieval using Turso's Tantivy-backed FTS
2222+* support semantic retrieval using vector embeddings
2323+* support hybrid ranking combining lexical and semantic signals
2424+* expose stable HTTP APIs for search and document lookup
2525+* support deployment on **Railway**
2626+2727+## 3. Non-Functional Goals
2828+2929+The system shall prioritize:
3030+3131+* **correctness of sync** — cursors never advance ahead of committed data
3232+* **operational simplicity** — single binary, subcommand-driven
3333+* **incremental delivery** — keyword search ships before embeddings
3434+* **small deployable services** — process groups, not microservices
3535+* **reindexability** — any document or collection can be re-normalized and re-indexed
3636+* **low coupling** — sync, indexing, and serving are independent concerns
3737+3838+## 4. Out of Scope (v1)
3939+4040+* code-aware symbol search
4141+* sourcegraph-style structural search
4242+* personalized ranking
4343+* access control beyond public/private visibility flags in indexed records
4444+* full analytics pipeline
4545+* custom ANN infrastructure outside Turso/libSQL
4646+4747+## 5. Design Principles
4848+4949+1. **Tap owns synchronization correctness.** The application does not consume the raw firehose. Tap handles connection, cryptographic verification, backfill, and filtering.
5050+5151+2. **The indexer owns denormalization.** Raw ATProto records are never queried directly by the public API.
5252+5353+3. **Search serves denormalized documents.** Search ranking depends on the document model, not transport.
5454+5555+4. **Keyword search is the baseline.** Semantic and hybrid search are layered on top.
5656+5757+5. **Embeddings are asynchronous.** Ingestion is never blocked on vector generation unless explicitly configured.
5858+5959+## 6. External Systems
6060+6161+- **AT Protocol network** — source of all Tangled content
6262+- **Tap** — filtered event delivery from the AT Protocol firehose (deployed on Railway)
6363+- **Turso/libSQL** — relational storage, Tantivy-backed FTS, and native vector search
6464+- **Embedding provider** — generates vectors for semantic search
6565+- **Railway** — deployment platform for Twister services and Tap
6666+6767+## 7. Architecture Summary
6868+6969+```text
7070+ATProto Firehose / PDS
7171+ │
7272+ ▼
7373+ Tap (Railway)
7474+ │ WebSocket / webhook JSON events
7575+ ▼
7676+ Go Indexer Service
7777+ ├─ decode Tap events
7878+ ├─ normalize records → documents
7979+ ├─ upsert documents
8080+ ├─ schedule embeddings
8181+ └─ persist sync cursor
8282+ │
8383+ ▼
8484+ Turso/libSQL
8585+ ├─ documents table
8686+ ├─ document_embeddings table
8787+ ├─ FTS index (Tantivy-backed)
8888+ ├─ vector index (DiskANN)
8989+ └─ sync_state table
9090+ │
9191+ ▼
9292+ Go Search API
9393+ ├─ keyword search (fts_match / fts_score)
9494+ ├─ semantic search (vector_top_k)
9595+ ├─ hybrid search (weighted merge)
9696+ └─ document fetch
9797+```
9898+9999+## 8. Runtime Units
100100+101101+| Unit | Role | Deployment |
102102+| -------------- | ----------------------------------- | ------------------------------- |
103103+| `api` | HTTP search and document API | Railway service (public) |
104104+| `indexer` | Tap consumer, normalizer, DB writer | Railway service (internal) |
105105+| `embed-worker` | Async embedding generation | Optional Railway service |
106106+| `tap` | ATProto sync | Railway (already deployed) |
107107+108108+## 9. Repository Structure
109109+110110+```text
111111+main.go
112112+113113+internal/
114114+ api/ # HTTP handlers, middleware, routes
115115+ config/ # Config struct, env parsing
116116+ embed/ # Embedding provider abstraction, worker
117117+ index/ # FTS and vector index management
118118+ ingest/ # Tap event consumer, ingestion loop
119119+ normalize/ # Per-collection record → document adapters
120120+ observability/# Structured logging, metrics
121121+ ranking/ # Score normalization, hybrid merge
122122+ search/ # Search orchestration (keyword, semantic, hybrid)
123123+ store/ # DB access layer, migrations, domain types
124124+ tapclient/ # Tap WebSocket/webhook client
125125+```
126126+127127+## 10. Binary Subcommands
128128+129129+```bash
130130+twister api # Start HTTP search API
131131+twister indexer # Start Tap consumer / indexer
132132+twister embed-worker # Start async embedding worker
133133+twister reindex # Re-normalize and upsert documents
134134+twister reembed # Re-generate embeddings
135135+twister backfill # Bootstrap index from seed users
136136+twister healthcheck # One-shot health probe
137137+```
138138+139139+## 11. Technology Choices
140140+141141+### Language: Go
142142+143143+Go is the implementation language for the API server, indexer, embedding worker, and CLI commands. Rationale: straightforward long-running services, excellent HTTP support, good concurrency model, small container footprint.
144144+145145+### Sync Layer: Tap
146146+147147+Tap is the only supported sync source in v1. It handles firehose connection, cryptographic verification, backfill, and filtering, then delivers simple JSON events via WebSocket or webhook.
148148+149149+**Tap is already deployed on Railway.** Twister connects to it as a WebSocket client.
150150+151151+#### Tap Capabilities
152152+153153+- Validates repo structure, MST integrity, and identity signatures
154154+- Automatic backfill fetches full repo history from PDS when repos are added
155155+- Filtered output by DID list, collection, or full network mode
156156+- Ordering guarantees: historical events (`live: false`) delivered before live events (`live: true`)
157157+158158+#### Tap Delivery Modes
159159+160160+| Mode | Config | Behavior |
161161+| -------------------------- | ----------------------- | ------------------------------------------------- |
162162+| WebSocket + acks (default) | — | Client acks each event; no data loss |
163163+| Fire-and-forget | `TAP_DISABLE_ACKS=true` | Events marked acked on receipt; simpler but lossy |
164164+| Webhook | `TAP_WEBHOOK_URL=...` | Events POSTed as JSON; acked on HTTP 200 |
165165+166166+#### Tap API Endpoints (reference)
167167+168168+| Endpoint | Method | Purpose |
169169+| --------------------- | ------ | ------------------------------------- |
170170+| `/health` | GET | Health check |
171171+| `/channel` | WS | WebSocket event stream |
172172+| `/repos/add` | POST | Add DIDs to track |
173173+| `/repos/remove` | POST | Stop tracking a repo |
174174+| `/info/:did` | GET | Repo state, rev, record count, errors |
175175+| `/stats/repo-count` | GET | Total tracked repos |
176176+| `/stats/record-count` | GET | Total tracked records |
177177+| `/stats/cursors` | GET | Firehose and list repos cursors |
178178+179179+#### Key Tap Configuration
180180+181181+| Variable | Default | Purpose |
182182+| ------------------------ | ------- | ---------------------------------------------------------------------------------- |
183183+| `TAP_SIGNAL_COLLECTION` | — | Auto-track repos with records in this collection |
184184+| `TAP_COLLECTION_FILTERS` | — | Comma-separated collection filters (e.g., `sh.tangled.repo,sh.tangled.repo.issue`) |
185185+| `TAP_ADMIN_PASSWORD` | — | Basic auth for API access |
186186+| `TAP_DISABLE_ACKS` | `false` | Fire-and-forget mode |
187187+| `TAP_WEBHOOK_URL` | — | Webhook delivery URL |
188188+189189+### Storage and Search: Turso/libSQL
190190+191191+Turso/libSQL is used for relational metadata storage, Tantivy-backed full-text search, and native vector search.
192192+193193+#### Go SDK Options
194194+195195+| Package | CGo | Embedded Replicas | Remote |
196196+| -------------------------------------------------- | --- | ----------------- | ------ |
197197+| `github.com/tursodatabase/go-libsql` | Yes | Yes | Yes |
198198+| `github.com/tursodatabase/libsql-client-go/libsql` | No | No | Yes |
199199+200200+Both register as `database/sql` drivers under `"libsql"`. They cannot be imported in the same binary.
201201+202202+**Recommendation:** Use `libsql-client-go` (pure Go, remote-only) unless embedded replicas are needed for local read performance.
203203+204204+#### Connection Patterns
205205+206206+```go
207207+// Remote only (pure Go, no CGo)
208208+import _ "github.com/tursodatabase/libsql-client-go/libsql"
209209+db, err := sql.Open("libsql", "libsql://your-db.turso.io?authToken=TOKEN")
210210+211211+// Embedded replica (CGo required)
212212+import "github.com/tursodatabase/go-libsql"
213213+connector, err := libsql.NewEmbeddedReplicaConnector(
214214+ "local.db", "libsql://your-db.turso.io",
215215+ libsql.WithAuthToken("TOKEN"),
216216+ libsql.WithSyncInterval(time.Minute),
217217+)
218218+db := sql.OpenDB(connector)
219219+```
220220+221221+#### Full-Text Search (Tantivy-backed)
222222+223223+Turso FTS is **not** standard SQLite FTS5. It uses Tantivy under the hood.
224224+225225+```sql
226226+-- Create FTS index with per-column tokenizers and weights
227227+CREATE INDEX idx_docs_fts ON documents USING fts (
228228+ title WITH tokenizer=default,
229229+ body WITH tokenizer=default,
230230+ summary WITH tokenizer=default,
231231+ repo_name WITH tokenizer=simple,
232232+ author_handle WITH tokenizer=raw
233233+) WITH (weights='title=3.0,repo_name=2.5,author_handle=2.0,summary=1.5,body=1.0');
234234+235235+-- Filter by match
236236+SELECT id, title FROM documents
237237+WHERE fts_match(title, body, summary, repo_name, author_handle, 'search query');
238238+239239+-- BM25 scoring
240240+SELECT id, title, fts_score(title, body, summary, repo_name, author_handle, 'search query') AS score
241241+FROM documents
242242+ORDER BY score DESC;
243243+244244+-- Highlighting
245245+SELECT fts_highlight(title, '<b>', '</b>', 'search query') AS highlighted
246246+FROM documents;
247247+```
248248+249249+**Available tokenizers:** `default` (Unicode-aware), `raw` (exact match), `simple` (whitespace+punctuation), `whitespace`, `ngram` (2-3 char n-grams).
250250+251251+**Query syntax (Tantivy):** `database AND search`, `database NOT nosql`, `"exact phrase"`, `data*` (prefix), `title:database` (field-specific), `title:database^2` (boosting).
252252+253253+**Limitations:** No snippet function (use highlighting). No automatic segment merging (manual `OPTIMIZE INDEX` required).
254254+No read-your-writes within a transaction. No MATCH operator (use `fts_match()` function).
255255+256256+#### Vector Search
257257+258258+```sql
259259+-- Vector column type
260260+embedding F32_BLOB(768)
261261+262262+-- Insert
263263+INSERT INTO document_embeddings (document_id, embedding, ...)
264264+VALUES (?, vector32(?), ...); -- ? is JSON array '[0.1, 0.2, ...]'
265265+266266+-- Brute-force similarity search
267267+SELECT d.id, vector_distance_cos(e.embedding, vector32(?)) AS distance
268268+FROM documents d
269269+JOIN document_embeddings e ON d.id = e.document_id
270270+ORDER BY distance ASC LIMIT 20;
271271+272272+-- Create ANN index (DiskANN)
273273+CREATE INDEX idx_embeddings ON document_embeddings(
274274+ libsql_vector_idx(embedding, 'metric=cosine')
275275+);
276276+277277+-- ANN search via index
278278+SELECT d.id, d.title
279279+FROM vector_top_k('idx_embeddings', vector32(?), 20) AS v
280280+JOIN document_embeddings e ON e.rowid = v.id
281281+JOIN documents d ON d.id = e.document_id;
282282+```
283283+284284+**Vector types:** `F32_BLOB` (recommended), `F16_BLOB`, `F64_BLOB`, `F8_BLOB`, `F1BIT_BLOB`.
285285+286286+**Distance functions:** `vector_distance_cos` (cosine), `vector_distance_l2` (Euclidean).
287287+288288+**Max dimensions:** 65,536. Dimension is fixed at table creation.
289289+290290+### Deployment: Railway
291291+292292+Railway is the deployment platform. It supports health checks, autodeploy, per-service scaling, and internal networking. Tap is already deployed here. Twister deploys as separate Railway services (api, indexer, embed-worker) within the same project.
+192
packages/api/docs/specs/02-tangled-lexicons.md
···11+---
22+title: "Spec 02 — Tangled Lexicons"
33+updated: 2026-03-22
44+source: https://github.com/mary-ext/atcute/tree/trunk/packages/definitions/tangled/lexicons/sh/tangled
55+---
66+77+All Tangled records use the `sh.tangled.*` namespace. Records use TID keys unless noted otherwise.
88+99+## 1. Searchable Record Types
1010+1111+These are the primary records Twister indexes for search.
1212+1313+### sh.tangled.repo
1414+1515+Repository metadata. Key: `tid`.
1616+1717+| Field | Type | Required | Description |
1818+| ------------- | -------- | -------- | ---------------------------------------------- |
1919+| `name` | string | yes | Repository name |
2020+| `knot` | string | yes | Knot (hosting node) where the repo was created |
2121+| `spindle` | string | no | CI runner for jobs |
2222+| `description` | string | no | 1–140 graphemes |
2323+| `website` | uri | no | Related URI |
2424+| `topics` | string[] | no | Up to 50 topic tags, each 1–50 chars |
2525+| `source` | uri | no | Upstream source |
2626+| `labels` | at-uri[] | no | Label definitions this repo subscribes to |
2727+| `createdAt` | datetime | yes | |
2828+2929+### sh.tangled.repo.issue
3030+3131+Issue on a repository. Key: `tid`.
3232+3333+| Field | Type | Required | Description |
3434+| ------------ | -------- | -------- | -------------------------------- |
3535+| `repo` | at-uri | yes | AT-URI of the parent repo record |
3636+| `title` | string | yes | Issue title |
3737+| `body` | string | no | Issue body (markdown) |
3838+| `createdAt` | datetime | yes | |
3939+| `mentions` | did[] | no | Mentioned users |
4040+| `references` | at-uri[] | no | Referenced records |
4141+4242+### sh.tangled.repo.pull
4343+4444+Pull request. Key: `tid`.
4545+4646+| Field | Type | Required | Description |
4747+| ------------ | -------- | -------- | -------------------------------------------------- |
4848+| `target` | object | yes | `{repo: at-uri, branch: string}` |
4949+| `title` | string | yes | PR title |
5050+| `body` | string | no | PR description (markdown) |
5151+| `patchBlob` | blob | yes | Patch content (`text/x-patch`) |
5252+| `source` | object | no | `{branch: string, sha: string(40), repo?: at-uri}` |
5353+| `createdAt` | datetime | yes | |
5454+| `mentions` | did[] | no | Mentioned users |
5555+| `references` | at-uri[] | no | Referenced records |
5656+5757+### sh.tangled.string
5858+5959+Code snippet / gist. Key: `tid`.
6060+6161+| Field | Type | Required | Description |
6262+| ------------- | -------- | -------- | ------------------- |
6363+| `filename` | string | yes | 1–140 graphemes |
6464+| `description` | string | yes | Up to 280 graphemes |
6565+| `createdAt` | datetime | yes | |
6666+| `contents` | string | yes | Snippet content |
6767+6868+### sh.tangled.actor.profile
6969+7070+User profile. Key: `literal:self` (singleton per account).
7171+7272+| Field | Type | Required | Description |
7373+| -------------------- | -------- | -------- | ---------------------------- |
7474+| `avatar` | blob | no | PNG/JPEG, max 1MB |
7575+| `description` | string | no | Bio, up to 256 graphemes |
7676+| `links` | uri[] | no | Up to 5 social/website links |
7777+| `stats` | string[] | no | Up to 2 vanity stat types |
7878+| `bluesky` | boolean | yes | Show Bluesky link |
7979+| `location` | string | no | Up to 40 graphemes |
8080+| `pinnedRepositories` | at-uri[] | no | Up to 6 pinned repos |
8181+| `pronouns` | string | no | Up to 40 chars |
8282+8383+## 2. Interaction Record Types
8484+8585+These records represent social interactions. They may be indexed for counts/signals but are lower priority for text search.
8686+8787+### sh.tangled.feed.star
8888+8989+Star/favorite on a record. Key: `tid`.
9090+9191+| Field | Type | Required |
9292+| ----------- | -------- | -------- |
9393+| `subject` | at-uri | yes |
9494+| `createdAt` | datetime | yes |
9595+9696+### sh.tangled.feed.reaction
9797+9898+Emoji reaction on a record. Key: `tid`.
9999+100100+| Field | Type | Required | Description |
101101+| ----------- | -------- | -------- | ------------------------------- |
102102+| `subject` | at-uri | yes | |
103103+| `reaction` | string | yes | One of: 👍 👎 😆 🎉 🫤 ❤️ 🚀 👀 |
104104+| `createdAt` | datetime | yes | |
105105+106106+### sh.tangled.graph.follow
107107+108108+Follow a user. Key: `tid`.
109109+110110+| Field | Type | Required |
111111+| ----------- | -------- | -------- |
112112+| `subject` | did | yes |
113113+| `createdAt` | datetime | yes |
114114+115115+## 3. State Record Types
116116+117117+These records track mutable state of issues and PRs.
118118+119119+### sh.tangled.repo.issue.state
120120+121121+| Field | Type | Required | Description |
122122+| ------- | ------ | -------- | -------------------------------------------------------------------------- |
123123+| `issue` | at-uri | yes | |
124124+| `state` | string | yes | `sh.tangled.repo.issue.state.open` or `sh.tangled.repo.issue.state.closed` |
125125+126126+### sh.tangled.repo.pull.status
127127+128128+| Field | Type | Required | Description |
129129+| -------- | ------ | -------- | ----------------------------------------------------------- |
130130+| `pull` | at-uri | yes | |
131131+| `status` | string | yes | `sh.tangled.repo.pull.status.open`, `.closed`, or `.merged` |
132132+133133+## 4. Comment Record Types
134134+135135+### sh.tangled.repo.issue.comment
136136+137137+| Field | Type | Required | Description |
138138+| ------------ | -------- | -------- | ------------------------------ |
139139+| `issue` | at-uri | yes | Parent issue |
140140+| `body` | string | yes | Comment body |
141141+| `createdAt` | datetime | yes | |
142142+| `replyTo` | at-uri | no | Parent comment (for threading) |
143143+| `mentions` | did[] | no | |
144144+| `references` | at-uri[] | no | |
145145+146146+### sh.tangled.repo.pull.comment
147147+148148+| Field | Type | Required | Description |
149149+| ------------ | -------- | -------- | ------------ |
150150+| `pull` | at-uri | yes | Parent PR |
151151+| `body` | string | yes | Comment body |
152152+| `createdAt` | datetime | yes | |
153153+| `mentions` | did[] | no | |
154154+| `references` | at-uri[] | no | |
155155+156156+## 5. Infrastructure Record Types
157157+158158+These are not indexed for search but may be consumed for operational context.
159159+160160+| Collection | Description |
161161+| ----------------------------- | ---------------------------------------------------- |
162162+| `sh.tangled.label.definition` | Label definitions with name, valueType, scope, color |
163163+| `sh.tangled.label.op` | Label application operations |
164164+| `sh.tangled.git.refUpdate` | Git reference update events |
165165+| `sh.tangled.knot.member` | Knot membership |
166166+| `sh.tangled.spindle.member` | Spindle (CI runner) membership |
167167+| `sh.tangled.pipeline.status` | CI pipeline status |
168168+169169+## 6. Collection Priority for v1 Indexing
170170+171171+| Priority | Collection | Rationale |
172172+| -------- | ------------------------------- | ------------------------------------ |
173173+| P0 | `sh.tangled.repo` | Core searchable content |
174174+| P0 | `sh.tangled.repo.issue` | High-signal text content |
175175+| P0 | `sh.tangled.repo.pull` | High-signal text content |
176176+| P1 | `sh.tangled.string` | Searchable code snippets |
177177+| P1 | `sh.tangled.actor.profile` | User/org discovery |
178178+| P2 | `sh.tangled.repo.issue.comment` | Body text, high volume |
179179+| P2 | `sh.tangled.repo.pull.comment` | Body text, high volume |
180180+| P2 | `sh.tangled.repo.issue.state` | State for filtering, not text search |
181181+| P2 | `sh.tangled.repo.pull.status` | State for filtering, not text search |
182182+| P3 | `sh.tangled.feed.star` | Ranking signal (star count) |
183183+| P3 | `sh.tangled.feed.reaction` | Ranking signal |
184184+| P3 | `sh.tangled.graph.follow` | Ranking signal |
185185+186186+### Tap Collection Filter for v1
187187+188188+```sh
189189+TAP_COLLECTION_FILTERS=sh.tangled.repo,sh.tangled.repo.issue,sh.tangled.repo.issue.comment,sh.tangled.repo.issue.state,sh.tangled.repo.pull,sh.tangled.repo.pull.comment,sh.tangled.repo.pull.status,sh.tangled.string,sh.tangled.actor.profile,sh.tangled.feed.star
190190+191191+# or sh.tangled.*
192192+```
+153
packages/api/docs/specs/03-data-model.md
···11+---
22+title: "Spec 03 — Data Model"
33+updated: 2026-03-22
44+---
55+66+## 1. Search Document
77+88+A **search document** is the internal denormalized representation used for retrieval. It is derived from one or more ATProto records via normalization.
99+1010+### Stable Identifier
1111+1212+```sh
1313+id = did + "|" + collection + "|" + rkey
1414+```
1515+1616+Example: `did:plc:abc123|sh.tangled.repo|3kb3fge5lm32x`
1717+1818+### Required Fields
1919+2020+| Field | Type | Description |
2121+| --------------- | ------- | -------------------------------------------------------------------------- |
2222+| `id` | TEXT PK | Stable composite identifier |
2323+| `did` | TEXT | Author DID |
2424+| `collection` | TEXT | ATProto collection NSID |
2525+| `rkey` | TEXT | Record key (TID) |
2626+| `at_uri` | TEXT | Full AT-URI |
2727+| `cid` | TEXT | Content identifier (hash) |
2828+| `record_type` | TEXT | Normalized type label (e.g., `repo`, `issue`, `pull`, `string`, `profile`) |
2929+| `title` | TEXT | Normalized title |
3030+| `body` | TEXT | Normalized body text |
3131+| `summary` | TEXT | Short summary / description |
3232+| `repo_did` | TEXT | DID of the repo owner (resolved from at-uri for issues/PRs) |
3333+| `repo_name` | TEXT | Repository name (resolved) |
3434+| `author_handle` | TEXT | Author handle (resolved via identity) |
3535+| `tags_json` | TEXT | JSON array of tags/topics |
3636+| `language` | TEXT | Detected or declared language |
3737+| `created_at` | TEXT | Record creation timestamp (ISO 8601) |
3838+| `updated_at` | TEXT | Last record update timestamp |
3939+| `indexed_at` | TEXT | When this document was last indexed |
4040+| `deleted_at` | TEXT | Soft-delete timestamp (tombstone) |
4141+4242+### Derived Fields (not stored in documents table)
4343+4444+| Field | Location | Description |
4545+| ---------------- | -------------------------------------- | ------------------------------ |
4646+| Embedding vector | `document_embeddings` table | F32_BLOB(N) |
4747+| FTS index | Turso FTS index | Tantivy-backed full-text index |
4848+| Star count | Aggregated from `sh.tangled.feed.star` | Ranking signal |
4949+5050+## 2. Core Documents Table
5151+5252+```sql
5353+CREATE TABLE documents (
5454+ id TEXT PRIMARY KEY,
5555+ did TEXT NOT NULL,
5656+ collection TEXT NOT NULL,
5757+ rkey TEXT NOT NULL,
5858+ at_uri TEXT NOT NULL,
5959+ cid TEXT NOT NULL,
6060+ record_type TEXT NOT NULL,
6161+ title TEXT,
6262+ body TEXT,
6363+ summary TEXT,
6464+ repo_did TEXT,
6565+ repo_name TEXT,
6666+ author_handle TEXT,
6767+ tags_json TEXT,
6868+ language TEXT,
6969+ created_at TEXT,
7070+ updated_at TEXT,
7171+ indexed_at TEXT NOT NULL,
7272+ deleted_at TEXT
7373+);
7474+7575+CREATE INDEX idx_documents_did ON documents(did);
7676+CREATE INDEX idx_documents_collection ON documents(collection);
7777+CREATE INDEX idx_documents_record_type ON documents(record_type);
7878+CREATE INDEX idx_documents_repo_did ON documents(repo_did);
7979+CREATE INDEX idx_documents_created_at ON documents(created_at);
8080+CREATE INDEX idx_documents_deleted_at ON documents(deleted_at);
8181+```
8282+8383+## 3. FTS Index
8484+8585+```sql
8686+CREATE INDEX idx_documents_fts ON documents USING fts (
8787+ title WITH tokenizer=default,
8888+ body WITH tokenizer=default,
8989+ summary WITH tokenizer=default,
9090+ repo_name WITH tokenizer=simple,
9191+ author_handle WITH tokenizer=raw,
9292+ tags_json WITH tokenizer=simple
9393+) WITH (weights='title=3.0,repo_name=2.5,author_handle=2.0,summary=1.5,tags_json=1.2,body=1.0');
9494+```
9595+9696+## 4. Embeddings Table
9797+9898+```sql
9999+CREATE TABLE document_embeddings (
100100+ document_id TEXT PRIMARY KEY REFERENCES documents(id),
101101+ embedding F32_BLOB(768),
102102+ embedding_model TEXT NOT NULL,
103103+ embedded_at TEXT NOT NULL
104104+);
105105+106106+CREATE INDEX idx_embeddings_vec ON document_embeddings(
107107+ libsql_vector_idx(embedding, 'metric=cosine')
108108+);
109109+```
110110+111111+The vector dimension (768) is configurable by model. Changing models requires a new column or table migration.
112112+113113+## 5. Sync State Table
114114+115115+```sql
116116+CREATE TABLE sync_state (
117117+ consumer_name TEXT PRIMARY KEY,
118118+ cursor TEXT NOT NULL,
119119+ high_water_mark TEXT,
120120+ updated_at TEXT NOT NULL
121121+);
122122+```
123123+124124+Stores the Tap event ID that has been successfully committed. On restart, the indexer resumes from this cursor.
125125+126126+## 6. Embedding Jobs Table
127127+128128+```sql
129129+CREATE TABLE embedding_jobs (
130130+ document_id TEXT PRIMARY KEY REFERENCES documents(id),
131131+ status TEXT NOT NULL, -- 'pending', 'processing', 'completed', 'failed'
132132+ attempts INTEGER NOT NULL DEFAULT 0,
133133+ last_error TEXT,
134134+ scheduled_at TEXT NOT NULL,
135135+ updated_at TEXT NOT NULL
136136+);
137137+138138+CREATE INDEX idx_embedding_jobs_status ON embedding_jobs(status);
139139+```
140140+141141+## 7. Issue/PR State Cache (optional)
142142+143143+To support filtering search results by issue state or PR status without joining back to the raw records:
144144+145145+```sql
146146+CREATE TABLE record_state (
147147+ subject_uri TEXT PRIMARY KEY, -- at-uri of the issue or PR
148148+ state TEXT NOT NULL, -- 'open', 'closed', 'merged'
149149+ updated_at TEXT NOT NULL
150150+);
151151+```
152152+153153+Updated when `sh.tangled.repo.issue.state` or `sh.tangled.repo.pull.status` events are ingested.
+360
packages/api/docs/specs/04-data-pipeline.md
···11+---
22+title: "Spec 04 — Data Pipeline"
33+updated: 2026-03-22
44+---
55+66+Covers the full data path: Tap event ingestion, record normalization, and failure handling.
77+88+## 1. Tap Event Format
99+1010+### Record Events
1111+1212+```json
1313+{
1414+ "id": 12345,
1515+ "type": "record",
1616+ "record": {
1717+ "live": true,
1818+ "rev": "3kb3fge5lm32x",
1919+ "did": "did:plc:abc123",
2020+ "collection": "sh.tangled.repo",
2121+ "rkey": "3kb3fge5lm32x",
2222+ "action": "create",
2323+ "cid": "bafyreig...",
2424+ "record": {
2525+ "$type": "sh.tangled.repo",
2626+ "name": "my-project",
2727+ "knot": "knot.tangled.org",
2828+ "description": "A cool project",
2929+ "topics": ["go", "search"],
3030+ "createdAt": "2026-03-22T12:00:00.000Z"
3131+ }
3232+ }
3333+}
3434+```
3535+3636+Key fields:
3737+3838+- `id` — monotonic event ID, used as cursor
3939+- `type` — `"record"` or `"identity"`
4040+- `record.live` — `true` for real-time events, `false` for backfill
4141+- `record.action` — `"create"`, `"update"`, or `"delete"`
4242+- `record.did` — author DID
4343+- `record.collection` — ATProto collection NSID
4444+- `record.rkey` — record key
4545+- `record.cid` — content identifier
4646+- `record.record` — the full ATProto record payload (absent on delete)
4747+4848+### Identity Events
4949+5050+```json
5151+{
5252+ "id": 12346,
5353+ "type": "identity",
5454+ "identity": {
5555+ "did": "did:plc:abc123",
5656+ "handle": "alice.tangled.org",
5757+ "isActive": true,
5858+ "status": "active"
5959+ }
6060+}
6161+```
6262+6363+Identity events are always delivered for tracked repos, regardless of collection filters.
6464+6565+## 2. WebSocket Protocol
6666+6767+### Connection
6868+6969+Connect to `wss://<tap-host>/channel` (or `ws://` for local dev).
7070+7171+If `TAP_ADMIN_PASSWORD` is set, authenticate with HTTP Basic auth (`admin:<password>`).
7272+7373+### Acknowledgment Protocol
7474+7575+Default mode requires the client to ack each event by sending the event `id` back over the WebSocket. Events are retried after `TAP_RETRY_TIMEOUT` (default 60s) if unacked.
7676+7777+For simpler development, set `TAP_DISABLE_ACKS=true` on Tap for fire-and-forget delivery.
7878+7979+### Ordering Guarantees
8080+8181+Events are ordered **per-repo** (per-DID), not globally:
8282+8383+- **Historical events** (`live: false`) may be sent concurrently within a repo
8484+- **Live events** (`live: true`) are synchronization barriers — all prior events for that repo must complete before a live event is sent
8585+- No ordering guarantee across different repos
8686+8787+Example sequence for one repo: `H1, H2, L1, H3, H4, L2`
8888+8989+- H1 and H2 sent concurrently
9090+- Wait for completion, send L1 alone
9191+- Wait for L1, send H3 and H4 concurrently
9292+- Wait for completion, send L2 alone
9393+9494+### Delivery Guarantee
9595+9696+Events are delivered **at least once**. Duplicates may occur on crashes or ack timeouts. The indexer must handle idempotent upserts.
9797+9898+## 3. Ingestion Contract
9999+100100+For each event, the indexer:
101101+102102+1. Validates `type` is `"record"` (identity events are handled separately)
103103+2. Checks `record.collection` against the allowlist
104104+3. Maps `record.action` to an operation:
105105+ - `create` → upsert document
106106+ - `update` → upsert document
107107+ - `delete` → tombstone document (`deleted_at = now`)
108108+4. Decodes `record.record` into the collection-specific struct
109109+5. Normalizes to internal `Document`
110110+6. Upserts into the documents table
111111+7. Schedules embedding job if eligible
112112+8. Persists cursor (`event.id`) **only after successful DB commit**
113113+114114+### Cursor Persistence Rules
115115+116116+- If DB commit fails → cursor does not advance → event will be retried
117117+- If normalization fails → log error, optionally dead-letter, skip → cursor advances
118118+- If embedding scheduling fails → document remains keyword-searchable → cursor advances
119119+120120+## 4. Backfill Behavior
121121+122122+When a repo is added to Tap (via `/repos/add`, signal collection, or full network mode):
123123+124124+1. Tap fetches full repo history from PDS via `com.atproto.sync.getRepo`
125125+2. Firehose events for that repo are buffered during backfill
126126+3. Historical events (`live: false`) are delivered first
127127+4. After backfill completes, buffered live events drain
128128+5. New firehose events stream normally (`live: true`)
129129+130130+### Application-Level Backfill Support
131131+132132+The indexer also supports:
133133+134134+- Full reindex from existing corpus (re-normalize all stored documents)
135135+- Targeted reindex by collection
136136+- Targeted reindex by DID
137137+138138+These do not involve Tap — they re-process documents already in the database.
139139+140140+## 5. Normalization
141141+142142+Normalization converts heterogeneous `sh.tangled.*` records into the common `Document` shape defined in [03-data-model.md](03-data-model.md).
143143+144144+### Adapter Interface
145145+146146+Each indexed collection provides an adapter:
147147+148148+```go
149149+type RecordAdapter interface {
150150+ Collection() string
151151+ RecordType() string
152152+ Normalize(event TapRecordEvent) (*Document, error)
153153+ Searchable(record map[string]any) bool
154154+}
155155+```
156156+157157+### Per-Collection Normalization
158158+159159+#### sh.tangled.repo → `repo`
160160+161161+| Document Field | Source |
162162+| -------------- | -------------------------------- |
163163+| `title` | `record.name` |
164164+| `body` | `record.description` |
165165+| `summary` | `record.description` (truncated) |
166166+| `repo_name` | `record.name` |
167167+| `repo_did` | `event.did` |
168168+| `tags_json` | `json(record.topics)` |
169169+| `created_at` | `record.createdAt` |
170170+171171+**Searchable:** Always (unless empty name).
172172+173173+#### sh.tangled.repo.issue → `issue`
174174+175175+| Document Field | Source |
176176+| -------------- | ------------------------------------------- |
177177+| `title` | `record.title` |
178178+| `body` | `record.body` |
179179+| `summary` | First ~200 chars of `record.body` |
180180+| `repo_did` | Extracted from `record.repo` AT-URI |
181181+| `repo_name` | Resolved from repo AT-URI |
182182+| `tags_json` | `[]` (labels resolved separately if needed) |
183183+| `created_at` | `record.createdAt` |
184184+185185+**Searchable:** Always.
186186+187187+#### sh.tangled.repo.pull → `pull`
188188+189189+| Document Field | Source |
190190+| -------------- | ------------------------------------------ |
191191+| `title` | `record.title` |
192192+| `body` | `record.body` |
193193+| `summary` | First ~200 chars of `record.body` |
194194+| `repo_did` | Extracted from `record.target.repo` AT-URI |
195195+| `repo_name` | Resolved from target repo AT-URI |
196196+| `tags_json` | `[]` |
197197+| `created_at` | `record.createdAt` |
198198+199199+**Searchable:** Always.
200200+201201+#### sh.tangled.string → `string`
202202+203203+| Document Field | Source |
204204+| -------------- | -------------------- |
205205+| `title` | `record.filename` |
206206+| `body` | `record.contents` |
207207+| `summary` | `record.description` |
208208+| `repo_name` | — |
209209+| `repo_did` | — |
210210+| `tags_json` | `[]` |
211211+| `created_at` | `record.createdAt` |
212212+213213+**Searchable:** Always (content is required).
214214+215215+#### sh.tangled.actor.profile → `profile`
216216+217217+| Document Field | Source |
218218+| -------------- | ---------------------------------------------------- |
219219+| `title` | Author handle (resolved from DID) |
220220+| `body` | `record.description` |
221221+| `summary` | `record.description` (truncated) + `record.location` |
222222+| `repo_name` | — |
223223+| `repo_did` | — |
224224+| `tags_json` | `[]` |
225225+| `created_at` | — (profiles don't have createdAt) |
226226+227227+**Searchable:** If `description` is non-empty.
228228+229229+#### sh.tangled.repo.issue.comment → `issue_comment`
230230+231231+| Document Field | Source |
232232+| -------------- | ----------------------------------------- |
233233+| `title` | — (derived: "Comment on {issue title}") |
234234+| `body` | `record.body` |
235235+| `summary` | First ~200 chars of `record.body` |
236236+| `repo_did` | Resolved from `record.issue` AT-URI chain |
237237+| `repo_name` | Resolved |
238238+| `created_at` | `record.createdAt` |
239239+240240+**Searchable:** If body is non-empty.
241241+242242+#### sh.tangled.repo.pull.comment → `pull_comment`
243243+244244+Same pattern as issue comments, using `record.pull` instead of `record.issue`.
245245+246246+### State Event Handling
247247+248248+State and status records (`sh.tangled.repo.issue.state`, `sh.tangled.repo.pull.status`) do **not** produce new search documents. Instead, they update the `record_state` cache table (see [03-data-model.md](03-data-model.md)).
249249+250250+### Interaction Event Handling
251251+252252+Stars (`sh.tangled.feed.star`) and reactions (`sh.tangled.feed.reaction`) do not produce search documents. They may be aggregated for ranking signals in later phases.
253253+254254+### Embedding Input Text
255255+256256+For documents eligible for embedding, compose the input as:
257257+258258+```sh
259259+{title}\n{repo_name}\n{author_handle}\n{tags}\n{summary}\n{body}
260260+```
261261+262262+Fields are joined with newlines. Empty fields are omitted.
263263+264264+### Repo Name Resolution
265265+266266+Issues, PRs, and comments reference their parent repo via AT-URI (e.g., `at://did:plc:abc/sh.tangled.repo/tid`). Resolving the repo name requires either:
267267+268268+1. Looking up the repo document in the local `documents` table
269269+2. Caching repo metadata in a lightweight lookup table
270270+271271+Option 1 is preferred for v1. If the repo document hasn't been indexed yet, `repo_name` is left empty and backfilled on the next reindex pass.
272272+273273+## 6. Identity Event Handling
274274+275275+Identity events should be used to maintain an author handle cache:
276276+277277+```sh
278278+did → handle mapping
279279+```
280280+281281+When an identity event arrives with a new handle, update `author_handle` on all documents with that DID. This ensures search by handle returns current results.
282282+283283+## 7. Repo Management
284284+285285+To add repos for tracking, POST to Tap's `/repos/add` endpoint:
286286+287287+```bash
288288+curl -u admin:PASSWORD -X POST https://tap-host/repos/add \
289289+ -H "Content-Type: application/json" \
290290+ -d '{"dids": ["did:plc:abc123", "did:plc:def456"]}'
291291+```
292292+293293+Alternatively, use `TAP_SIGNAL_COLLECTION=sh.tangled.repo` to auto-track any repo that has Tangled repo records.
294294+295295+## 8. Failure Handling
296296+297297+### Ingestion Failures
298298+299299+If Tap event processing fails before DB commit:
300300+301301+- Log the failure with event ID, DID, collection, rkey, and error class
302302+- Retry with exponential backoff (for transient errors like DB timeouts)
303303+- Do **not** advance cursor — the event will be re-delivered by Tap
304304+- After max retries for a persistent error, log and skip (cursor advances)
305305+306306+### Normalization Failures
307307+308308+If a record cannot be normalized:
309309+310310+- Log collection, DID, rkey, CID, and error class
311311+- Do not crash the process
312312+- Skip the event and advance cursor
313313+- Optionally insert into a `dead_letter` table for manual inspection
314314+315315+### Embedding Failures
316316+317317+If embedding generation fails:
318318+319319+- The document remains keyword-searchable
320320+- The embedding job is marked `failed` with `last_error` and incremented `attempts`
321321+- Jobs are retried with exponential backoff up to a max attempt count
322322+- After max attempts, the job enters `dead` state
323323+- The embed-worker exposes failed job count as a metric
324324+325325+### DB Failures
326326+327327+If Turso/libSQL is unreachable:
328328+329329+- **API** returns `503` for search endpoints; `/healthz` still returns 200 (liveness), `/readyz` returns 503
330330+- **Indexer** pauses event processing and retries DB connection with backoff; cursor does not advance
331331+- **Embed-worker** pauses job processing and retries
332332+333333+### Tap Connection Failures
334334+335335+If the WebSocket connection to Tap drops:
336336+337337+- Reconnect with exponential backoff
338338+- Resume from the last persisted cursor
339339+- Log reconnection attempts and success
340340+341341+Tap itself handles firehose reconnection independently — a Tap restart does not require indexer intervention beyond reconnecting the WebSocket.
342342+343343+### Duplicate Event Handling
344344+345345+Tap delivers events **at least once**. Duplicates are handled by:
346346+347347+- Using `id = did|collection|rkey` as the primary key
348348+- All writes are upserts (`INSERT OR REPLACE` / `ON CONFLICT ... DO UPDATE`)
349349+- CID comparison can detect true no-ops (same content) vs. actual updates
350350+351351+### Startup Recovery
352352+353353+On indexer startup:
354354+355355+1. Read `cursor` from `sync_state` table
356356+2. Connect to Tap WebSocket
357357+3. Tap replays events from the stored cursor position
358358+4. Processing resumes normally
359359+360360+If no cursor exists (first run), Tap delivers all historical events from backfill.
+296
packages/api/docs/specs/05-search.md
···11+---
22+title: "Spec 05 — Search"
33+updated: 2026-03-22
44+---
55+66+Covers all search modes, the API contract, scoring, and filtering.
77+88+## 1. Search Modes
99+1010+| Mode | Backing | Available |
1111+|------|---------|-----------|
1212+| `keyword` | Turso Tantivy-backed FTS | MVP |
1313+| `semantic` | Vector similarity (DiskANN index) | Phase 2 |
1414+| `hybrid` | Weighted merge of keyword + semantic | Phase 3 |
1515+1616+## 2. Keyword Search
1717+1818+### Implementation
1919+2020+Uses Turso's `fts_score()` function for BM25 ranking:
2121+2222+```sql
2323+SELECT
2424+ d.id, d.title, d.summary, d.repo_name, d.author_handle,
2525+ d.collection, d.record_type, d.updated_at,
2626+ fts_score(d.title, d.body, d.summary, d.repo_name, d.author_handle, d.tags_json, ?) AS score
2727+FROM documents d
2828+WHERE fts_match(d.title, d.body, d.summary, d.repo_name, d.author_handle, d.tags_json, ?)
2929+ AND d.deleted_at IS NULL
3030+ORDER BY score DESC
3131+LIMIT ? OFFSET ?;
3232+```
3333+3434+### Field Weights
3535+3636+Configured in the FTS index definition:
3737+3838+| Field | Weight | Rationale |
3939+|-------|--------|-----------|
4040+| `title` | 3.0 | Highest signal for relevance |
4141+| `repo_name` | 2.5 | Exact repo lookups should rank first |
4242+| `author_handle` | 2.0 | Author search is common |
4343+| `summary` | 1.5 | More focused than body |
4444+| `tags_json` | 1.2 | Topic matching |
4545+| `body` | 1.0 | Baseline |
4646+4747+### Query Features
4848+4949+Tantivy query syntax is exposed to users:
5050+5151+- Boolean: `go AND search`, `rust NOT unsafe`
5252+- Phrase: `"pull request"`
5353+- Prefix: `tang*`
5454+- Field-specific: `title:parser`
5555+5656+### Snippets
5757+5858+Use `fts_highlight()` to generate highlighted snippets:
5959+6060+```sql
6161+fts_highlight(d.body, '<mark>', '</mark>', ?) AS body_snippet
6262+```
6363+6464+## 3. Semantic Search
6565+6666+### Query Flow
6767+6868+1. Convert user query text to embedding via the configured provider
6969+2. Query `vector_top_k` for nearest neighbors
7070+3. Join back to `documents` to get metadata
7171+4. Filter out deleted/hidden documents
7272+5. Return results with distance as score
7373+7474+```sql
7575+SELECT d.id, d.title, d.summary, d.repo_name, d.author_handle,
7676+ d.collection, d.record_type, d.updated_at
7777+FROM vector_top_k('idx_embeddings_vec', vector32(?), ?) AS v
7878+JOIN document_embeddings e ON e.rowid = v.id
7979+JOIN documents d ON d.id = e.document_id
8080+WHERE d.deleted_at IS NULL;
8181+```
8282+8383+### Score Normalization
8484+8585+Cosine distance ranges from 0 (identical) to 2 (opposite). Normalize to a 0–1 relevance score:
8686+8787+```
8888+semantic_score = 1.0 - (distance / 2.0)
8989+```
9090+9191+## 4. Hybrid Search
9292+9393+### v1: Weighted Score Blending
9494+9595+```
9696+hybrid_score = 0.65 * keyword_score_normalized + 0.35 * semantic_score_normalized
9797+```
9898+9999+### Score Normalization for Blending
100100+101101+Keyword (BM25) scores are unbounded. Normalize using min-max within the result set:
102102+103103+```
104104+keyword_normalized = (score - min_score) / (max_score - min_score)
105105+```
106106+107107+Semantic scores are already bounded after the distance-to-relevance conversion.
108108+109109+### Merge Strategy
110110+111111+1. Fetch top N keyword results (e.g., N=50)
112112+2. Fetch top N semantic results
113113+3. Merge on `document_id`
114114+4. For documents appearing in both sets, combine scores
115115+5. For documents in only one set, use that score (with 0 for the missing signal)
116116+6. Sort by `hybrid_score` descending
117117+7. Deduplicate
118118+8. Apply limit/offset
119119+120120+### v2: Reciprocal Rank Fusion (future)
121121+122122+If keyword and semantic score scales prove unstable under weighted blending, replace with RRF:
123123+124124+```
125125+rrf_score = Σ 1 / (k + rank_i)
126126+```
127127+128128+where `k` is a constant (typically 60) and `rank_i` is the document's rank in each result list.
129129+130130+## 5. Filtering
131131+132132+All search modes support these filters, applied as SQL WHERE clauses:
133133+134134+| Filter | Parameter | SQL |
135135+|--------|-----------|-----|
136136+| Collection | `collection` | `d.collection = ?` |
137137+| Author | `author` | `d.author_handle = ?` or `d.did = ?` |
138138+| Repo | `repo` | `d.repo_name = ?` or `d.repo_did = ?` |
139139+| Record type | `type` | `d.record_type = ?` |
140140+| Language | `language` | `d.language = ?` |
141141+| Date range | `from`, `to` | `d.created_at >= ?` and `d.created_at <= ?` |
142142+| State | `state` | Join to `record_state` table |
143143+144144+## 6. Embedding Eligibility
145145+146146+A document is eligible for embedding if:
147147+148148+- `deleted_at IS NULL`
149149+- `record_type` is one of: `repo`, `issue`, `pull`, `string`, `profile`
150150+- At least one of `title`, `body`, or `summary` is non-empty
151151+- Total text length exceeds a minimum threshold (e.g., 20 characters)
152152+153153+## 7. API Endpoints
154154+155155+### Health
156156+157157+| Method | Path | Description |
158158+| ------ | ---------- | -------------------------------- |
159159+| GET | `/healthz` | Liveness — process is responsive |
160160+| GET | `/readyz` | Readiness — DB is reachable |
161161+162162+### Search
163163+164164+| Method | Path | Description |
165165+| ------ | ------------------ | ------------------------------------------------ |
166166+| GET | `/search` | Search with configurable mode (default: keyword) |
167167+| GET | `/search/keyword` | Keyword-only search |
168168+| GET | `/search/semantic` | Semantic-only search |
169169+| GET | `/search/hybrid` | Hybrid search |
170170+171171+### Documents
172172+173173+| Method | Path | Description |
174174+| ------ | ----------------- | ----------------------------- |
175175+| GET | `/documents/{id}` | Fetch a single document by ID |
176176+177177+### Admin
178178+179179+| Method | Path | Description |
180180+| ------ | ---------------- | -------------------- |
181181+| POST | `/admin/reindex` | Trigger reindex |
182182+| POST | `/admin/reembed` | Trigger re-embedding |
183183+184184+Admin endpoints are disabled by default. Enable with `ENABLE_ADMIN_ENDPOINTS=true`.
185185+186186+## 8. Query Parameters
187187+188188+| Parameter | Type | Default | Description |
189189+| ------------ | ------ | --------- | -------------------------------------------------------------------- |
190190+| `q` | string | required | Search query |
191191+| `mode` | string | `keyword` | `keyword`, `semantic`, or `hybrid` |
192192+| `limit` | int | 20 | Results per page (max: `SEARCH_MAX_LIMIT`) |
193193+| `offset` | int | 0 | Pagination offset |
194194+| `collection` | string | — | Filter by `sh.tangled.*` collection |
195195+| `type` | string | — | Filter by record type (`repo`, `issue`, `pull`, `string`, `profile`) |
196196+| `author` | string | — | Filter by author handle or DID |
197197+| `repo` | string | — | Filter by repo name or repo DID |
198198+| `language` | string | — | Filter by language |
199199+| `from` | string | — | Created after (ISO 8601) |
200200+| `to` | string | — | Created before (ISO 8601) |
201201+| `state` | string | — | Filter by state (`open`, `closed`, `merged`) |
202202+203203+## 9. Search Response
204204+205205+```json
206206+{
207207+ "query": "rust markdown tui",
208208+ "mode": "hybrid",
209209+ "total": 142,
210210+ "limit": 20,
211211+ "offset": 0,
212212+ "results": [
213213+ {
214214+ "id": "did:plc:abc|sh.tangled.repo|3kb3fge5lm32x",
215215+ "collection": "sh.tangled.repo",
216216+ "record_type": "repo",
217217+ "title": "glow-rs",
218218+ "body_snippet": "A TUI markdown viewer inspired by <mark>Glow</mark>...",
219219+ "summary": "Rust TUI markdown viewer",
220220+ "repo_name": "glow-rs",
221221+ "author_handle": "desertthunder.dev",
222222+ "score": 0.842,
223223+ "matched_by": ["keyword", "semantic"],
224224+ "created_at": "2026-03-20T10:00:00Z",
225225+ "updated_at": "2026-03-22T15:03:11Z"
226226+ }
227227+ ]
228228+}
229229+```
230230+231231+### Result Fields
232232+233233+| Field | Type | Description |
234234+| --------------- | -------- | --------------------------------------- |
235235+| `id` | string | Document stable ID |
236236+| `collection` | string | ATProto collection NSID |
237237+| `record_type` | string | Normalized type label |
238238+| `title` | string | Document title |
239239+| `body_snippet` | string | Highlighted body excerpt |
240240+| `summary` | string | Short description |
241241+| `repo_name` | string | Repository name (if applicable) |
242242+| `author_handle` | string | Author handle |
243243+| `score` | float | Relevance score (0–1) |
244244+| `matched_by` | string[] | Which search modes produced this result |
245245+| `created_at` | string | ISO 8601 creation timestamp |
246246+| `updated_at` | string | ISO 8601 last update timestamp |
247247+248248+## 10. Document Response
249249+250250+`GET /documents/{id}` returns the full document:
251251+252252+```json
253253+{
254254+ "id": "did:plc:abc|sh.tangled.repo|3kb3fge5lm32x",
255255+ "did": "did:plc:abc",
256256+ "collection": "sh.tangled.repo",
257257+ "rkey": "3kb3fge5lm32x",
258258+ "at_uri": "at://did:plc:abc/sh.tangled.repo/3kb3fge5lm32x",
259259+ "cid": "bafyreig...",
260260+ "record_type": "repo",
261261+ "title": "glow-rs",
262262+ "body": "A TUI markdown viewer inspired by Glow, written in Rust.",
263263+ "summary": "Rust TUI markdown viewer",
264264+ "repo_name": "glow-rs",
265265+ "author_handle": "desertthunder.dev",
266266+ "tags_json": "[\"rust\", \"tui\", \"markdown\"]",
267267+ "language": "en",
268268+ "created_at": "2026-03-20T10:00:00Z",
269269+ "updated_at": "2026-03-22T15:03:11Z",
270270+ "indexed_at": "2026-03-22T15:05:00Z",
271271+ "has_embedding": true
272272+}
273273+```
274274+275275+## 11. Error Responses
276276+277277+| Status | Condition |
278278+| ------ | ------------------------------------------------------------------ |
279279+| 400 | Missing `q` parameter, invalid `limit`/`offset`, malformed filters |
280280+| 404 | Document not found |
281281+| 503 | DB unreachable (readiness failure) |
282282+283283+```json
284284+{
285285+ "error": "invalid_parameter",
286286+ "message": "limit must be between 1 and 100"
287287+}
288288+```
289289+290290+## 12. API Behavior
291291+292292+- `keyword` returns only lexical matches via `fts_match`/`fts_score`
293293+- `semantic` returns only embedding-backed matches via `vector_top_k`
294294+- `hybrid` merges both result sets and reranks
295295+- All modes exclude documents with `deleted_at IS NOT NULL` by default
296296+- Pagination uses `limit`/`offset` (cursor-based pagination deferred)
+325
packages/api/docs/specs/06-operations.md
···11+---
22+title: "Spec 06 — Operations"
33+updated: 2026-03-22
44+---
55+66+Covers configuration, observability, security, and deployment.
77+88+## 1. Configuration
99+1010+All configuration is via environment variables.
1111+1212+### Required
1313+1414+| Variable | Description |
1515+| --------------------- | ----------------------------------------------------------- |
1616+| `TAP_URL` | Tap WebSocket URL (e.g., `wss://tap.example.com/channel`) |
1717+| `TAP_AUTH_PASSWORD` | Tap admin password for Basic auth (if set on Tap) |
1818+| `TURSO_DATABASE_URL` | Turso connection URL (e.g., `libsql://db-name.turso.io`) |
1919+| `TURSO_AUTH_TOKEN` | Turso JWT auth token |
2020+| `INDEXED_COLLECTIONS` | Comma-separated list of `sh.tangled.*` collections to index |
2121+2222+### Search
2323+2424+| Variable | Default | Description |
2525+| ---------------------- | --------- | ------------------------ |
2626+| `SEARCH_DEFAULT_LIMIT` | `20` | Default results per page |
2727+| `SEARCH_MAX_LIMIT` | `100` | Maximum results per page |
2828+| `SEARCH_DEFAULT_MODE` | `keyword` | Default search mode |
2929+3030+### Embedding
3131+3232+| Variable | Default | Description |
3333+| ---------------------- | ------- | ---------------------------------------------------- |
3434+| `EMBEDDING_PROVIDER` | — | Provider name (e.g., `openai`, `ollama`, `voyageai`) |
3535+| `EMBEDDING_MODEL` | — | Model name (e.g., `text-embedding-3-small`) |
3636+| `EMBEDDING_API_KEY` | — | Provider API key |
3737+| `EMBEDDING_API_URL` | — | Provider base URL (for self-hosted) |
3838+| `EMBEDDING_DIM` | `768` | Vector dimensionality |
3939+| `EMBEDDING_BATCH_SIZE` | `32` | Batch size for embed-worker |
4040+4141+### Hybrid Search
4242+4343+| Variable | Default | Description |
4444+| ------------------------ | ------- | --------------------------------------- |
4545+| `HYBRID_KEYWORD_WEIGHT` | `0.65` | Keyword score weight in hybrid ranking |
4646+| `HYBRID_SEMANTIC_WEIGHT` | `0.35` | Semantic score weight in hybrid ranking |
4747+4848+### Server
4949+5050+| Variable | Default | Description |
5151+| ------------------------ | ------- | ------------------------------------------- |
5252+| `HTTP_BIND_ADDR` | `:8080` | API server bind address |
5353+| `LOG_LEVEL` | `info` | Log level: `debug`, `info`, `warn`, `error` |
5454+| `LOG_FORMAT` | `json` | Log format: `json` or `text` |
5555+| `ENABLE_ADMIN_ENDPOINTS` | `false` | Enable `/admin/*` endpoints |
5656+| `ADMIN_AUTH_TOKEN` | — | Bearer token for admin endpoints |
5757+5858+### Example `.env`
5959+6060+```bash
6161+# Tap (deployed on Railway)
6262+TAP_URL=wss://tap-instance.up.railway.app/channel
6363+TAP_AUTH_PASSWORD=your-tap-admin-password
6464+6565+# Turso
6666+TURSO_DATABASE_URL=libsql://twister-db.turso.io
6767+TURSO_AUTH_TOKEN=eyJhbGci...
6868+6969+# Collections
7070+INDEXED_COLLECTIONS=sh.tangled.repo,sh.tangled.repo.issue,sh.tangled.repo.pull,sh.tangled.string,sh.tangled.actor.profile,sh.tangled.repo.issue.comment,sh.tangled.repo.pull.comment,sh.tangled.repo.issue.state,sh.tangled.repo.pull.status,sh.tangled.feed.star
7171+7272+# Search
7373+SEARCH_DEFAULT_LIMIT=20
7474+SEARCH_MAX_LIMIT=100
7575+7676+# Embedding (Phase 2)
7777+# EMBEDDING_PROVIDER=openai
7878+# EMBEDDING_MODEL=text-embedding-3-small
7979+# EMBEDDING_API_KEY=sk-...
8080+# EMBEDDING_DIM=768
8181+8282+# Server
8383+HTTP_BIND_ADDR=:8080
8484+LOG_LEVEL=info
8585+ENABLE_ADMIN_ENDPOINTS=false
8686+```
8787+8888+## 2. Observability
8989+9090+### Structured Logging
9191+9292+Use Go's `slog` with JSON output. Every log entry includes:
9393+9494+| Field | Description |
9595+| --------- | ----------------------------------- |
9696+| `ts` | Timestamp (RFC 3339) |
9797+| `level` | Log level |
9898+| `service` | `api`, `indexer`, or `embed-worker` |
9999+| `msg` | Human-readable message |
100100+101101+#### Context Fields (where applicable)
102102+103103+| Field | When |
104104+| ------------- | ------------------------ |
105105+| `event_name` | Tap event processing |
106106+| `event_id` | Tap event ID |
107107+| `document_id` | Document operations |
108108+| `did` | Any DID-scoped operation |
109109+| `collection` | Record processing |
110110+| `rkey` | Record processing |
111111+| `cursor` | Cursor persistence |
112112+| `error_class` | Error handling |
113113+| `duration_ms` | Timed operations |
114114+115115+### Metrics
116116+117117+Recommended counters and gauges (via logs, Prometheus, or platform metrics):
118118+119119+#### Ingestion
120120+121121+| Metric | Type | Description |
122122+| ------------------------------ | --------- | ---------------------------------- |
123123+| `events_processed_total` | counter | Total Tap events processed |
124124+| `events_failed_total` | counter | Events that failed processing |
125125+| `normalization_failures_total` | counter | Normalization errors by collection |
126126+| `upsert_duration_ms` | histogram | DB upsert latency |
127127+| `cursor_position` | gauge | Current Tap cursor position |
128128+129129+#### Embedding
130130+131131+| Metric | Type | Description |
132132+| -------------------------- | --------- | ------------------------------ |
133133+| `embedding_queue_depth` | gauge | Pending embedding jobs |
134134+| `embedding_failures_total` | counter | Failed embedding attempts |
135135+| `embedding_duration_ms` | histogram | Per-document embedding latency |
136136+137137+#### Search
138138+139139+| Metric | Type | Description |
140140+| ----------------------- | --------- | -------------------------- |
141141+| `search_requests_total` | counter | Requests by mode |
142142+| `search_duration_ms` | histogram | Query latency by mode |
143143+| `search_results_count` | histogram | Results returned per query |
144144+145145+### Health Checks
146146+147147+#### API Process
148148+149149+| Endpoint | Check | Healthy |
150150+| -------------- | --------------------- | ------------------- |
151151+| `GET /healthz` | Process is responsive | Always (liveness) |
152152+| `GET /readyz` | DB connection works | `SELECT 1` succeeds |
153153+154154+#### Indexer Process
155155+156156+The indexer exposes a top-level health probe (not HTTP-routed):
157157+158158+- Tap WebSocket connected or reconnecting
159159+- Cursor advancing or intentionally idle
160160+- DB reachable
161161+162162+On Railway, this is a health check endpoint on a separate port (9090).
163163+164164+#### Embed Worker
165165+166166+- DB reachable
167167+- Embedding provider reachable (periodic test call)
168168+- Job queue not stalled (jobs processing within expected timeframe)
169169+170170+## 3. Security
171171+172172+### Secrets Management
173173+174174+Secrets are injected through platform secret management:
175175+176176+- **Railway:** Environment variables in the dashboard or `railway variables`
177177+178178+Secrets are never stored in code, config files, or Docker images.
179179+180180+Required secrets:
181181+182182+| Secret | Purpose |
183183+| ------------------- | --------------------------------- |
184184+| `TURSO_AUTH_TOKEN` | Turso database authentication |
185185+| `TAP_AUTH_PASSWORD` | Tap admin API authentication |
186186+| `EMBEDDING_API_KEY` | Embedding provider authentication |
187187+| `ADMIN_AUTH_TOKEN` | Admin endpoint authentication |
188188+189189+### Admin Endpoints
190190+191191+Admin endpoints (`/admin/reindex`, `/admin/reembed`) are:
192192+193193+- Disabled by default (`ENABLE_ADMIN_ENDPOINTS=false`)
194194+- When enabled, protected by bearer token (`ADMIN_AUTH_TOKEN`)
195195+- Alternatively, exposed only on internal networking (Railway private networking)
196196+197197+### Input Validation
198198+199199+The search API shall:
200200+201201+- Validate `limit` is between 1 and `SEARCH_MAX_LIMIT`
202202+- Validate `offset` is non-negative
203203+- Reject unknown or malformed filter parameters with 400
204204+- Sanitize query strings before passing to FTS (Tantivy query parser handles this, but validate basic structure)
205205+- Bound hybrid requests (limit concurrent vector searches)
206206+207207+### Tap Authentication
208208+209209+The indexer authenticates to Tap using HTTP Basic auth (`admin:<TAP_AUTH_PASSWORD>`). The WebSocket upgrade request includes the auth header.
210210+211211+### Data Privacy
212212+213213+- All indexed content is public ATProto data
214214+- No private or authenticated content is ingested
215215+- Deleted records are tombstoned (`deleted_at` set) and excluded from search results
216216+- Tombstoned documents are periodically purged (configurable retention)
217217+218218+## 4. Deployment
219219+220220+### Railway (Primary)
221221+222222+All Twister services deploy as separate Railway services within the same project. Tap is already deployed here.
223223+224224+#### Service Layout
225225+226226+| Service | Start Command | Health Check | Public |
227227+| ------------ | ---------------------- | ------------------ | ------ |
228228+| tap | (already deployed) | `GET /health` | no |
229229+| api | `twister api` | `GET /healthz` | yes |
230230+| indexer | `twister indexer` | `GET :9090/health` | no |
231231+| embed-worker | `twister embed-worker` | `GET :9091/health` | no |
232232+233233+All services share the same Docker image. Railway uses the start command to select the subcommand.
234234+235235+#### Environment Variables
236236+237237+Set per-service in the Railway dashboard or via `railway variables`:
238238+239239+```bash
240240+# Shared across services
241241+TURSO_DATABASE_URL=libsql://twister-db.turso.io
242242+TURSO_AUTH_TOKEN=eyJ...
243243+LOG_LEVEL=info
244244+LOG_FORMAT=json
245245+246246+# API service
247247+HTTP_BIND_ADDR=:8080
248248+SEARCH_DEFAULT_LIMIT=20
249249+SEARCH_MAX_LIMIT=100
250250+ENABLE_ADMIN_ENDPOINTS=false
251251+252252+# Indexer service
253253+TAP_URL=wss://${{tap.RAILWAY_PUBLIC_DOMAIN}}/channel # Railway service reference
254254+TAP_AUTH_PASSWORD=...
255255+INDEXED_COLLECTIONS=sh.tangled.repo,sh.tangled.repo.issue,sh.tangled.repo.pull,sh.tangled.string,sh.tangled.actor.profile
256256+257257+# Embed-worker (Phase 2)
258258+# EMBEDDING_PROVIDER=openai
259259+# EMBEDDING_MODEL=text-embedding-3-small
260260+# EMBEDDING_API_KEY=sk-...
261261+```
262262+263263+Railway supports referencing other services' variables with `${{service.VAR}}` syntax, which is useful for linking the indexer to Tap's domain.
264264+265265+#### Health Checks
266266+267267+Railway activates deployments based on health check responses. Configure per-service:
268268+269269+- **api:** HTTP health check on `/healthz` port 8080
270270+- **indexer:** HTTP health check on `/health` port 9090
271271+- **embed-worker:** HTTP health check on `/health` port 9091
272272+273273+#### Autodeploy
274274+275275+Connect the GitHub repository for automatic deployments on push. Railway builds from the Dockerfile and uses the start command configured per service.
276276+277277+#### Internal Networking
278278+279279+Railway services within the same project can communicate over private networking using `service.railway.internal` hostnames. The indexer connects to Tap via this internal network when both are in the same project.
280280+281281+### Dockerfile
282282+283283+```dockerfile
284284+FROM golang:1.24-alpine AS builder
285285+286286+WORKDIR /app
287287+288288+COPY go.mod go.sum ./
289289+RUN go mod download
290290+291291+COPY . .
292292+293293+RUN CGO_ENABLED=0 GOOS=linux go build \
294294+ -ldflags="-s -w" \
295295+ -o /app/twister \
296296+ ./main.go
297297+298298+FROM alpine:3.21
299299+300300+RUN apk add --no-cache ca-certificates tzdata
301301+302302+COPY --from=builder /app/twister /usr/local/bin/twister
303303+304304+EXPOSE 8080 9090 9091
305305+306306+CMD ["twister", "api"]
307307+```
308308+309309+Notes:
310310+311311+- `CGO_ENABLED=0` for static binary (required if using `libsql-client-go`; not compatible with `go-libsql` which needs CGo)
312312+- Railway overrides `CMD` with the start command configured per service
313313+- Multiple ports exposed: 8080 (API), 9090 (indexer health), 9091 (embed-worker health)
314314+315315+### Graceful Shutdown
316316+317317+All processes handle `SIGTERM` and `SIGINT`:
318318+319319+1. Stop accepting new requests/events
320320+2. Drain in-flight work (with timeout)
321321+3. Persist current cursor (indexer)
322322+4. Close DB connections
323323+5. Exit 0
324324+325325+Railway sends `SIGTERM` during deployments and restarts.
+137
packages/api/docs/specs/07-graph-backfill.md
···11+---
22+title: "Spec 07 — Graph Backfill"
33+updated: 2026-03-22
44+---
55+66+## 1. Purpose
77+88+Bootstrap the search index with existing Tangled content by discovering users from a seed set and triggering Tap backfill for their repositories. Without this, the index only captures new events after deployment.
99+1010+## 2. Seed Set
1111+1212+A manually curated list of known Tangled users (DIDs or handles), stored in a plain text file:
1313+1414+```text
1515+# Known active Tangled users
1616+did:plc:abc123
1717+did:plc:def456
1818+alice.tangled.sh
1919+bob.tangled.sh
2020+# Add more as discovered
2121+```
2222+2323+Format:
2424+- One entry per line
2525+- Lines starting with `#` are comments
2626+- Blank lines are ignored
2727+- Entries can be DIDs (`did:plc:...`) or handles (`alice.tangled.sh`)
2828+- Handles are resolved to DIDs before processing
2929+3030+## 3. Fan-Out Strategy
3131+3232+From each seed user, discover connected users to expand the crawl set:
3333+3434+### Discovery Sources
3535+3636+1. **Follows**: Fetch `sh.tangled.graph.follow` records for the user → extract `subject` DIDs
3737+2. **Collaborators**: For repos owned by the user, identify other users who have created issues, PRs, or comments → extract their DIDs
3838+3939+### Depth Limit
4040+4141+Fan-out is configurable with a max hops parameter (default: 2):
4242+4343+- **Hop 0**: Seed users themselves
4444+- **Hop 1**: Direct follows and collaborators of seed users
4545+- **Hop 2**: Follows and collaborators of hop-1 users
4646+4747+Higher hop counts discover more users but increase time and may pull in loosely related accounts. Start with 2 hops and adjust based on the size of the Tangled network.
4848+4949+### Crawl Queue
5050+5151+Discovered DIDs are added to a queue, deduplicated by DID. Each entry tracks:
5252+- DID
5353+- Discovery hop (distance from seed)
5454+- Source (which seed/user led to discovery)
5555+5656+## 4. Backfill Mechanism
5757+5858+For each discovered user:
5959+6060+1. **Check if already tracked**: Query Tap's `/info/:did` endpoint — if the repo is already tracked and backfilled, skip
6161+2. **Register with Tap**: POST to `/repos/add` with the DID — Tap handles the actual repo export and event delivery
6262+3. **Tap backfill flow**: Tap fetches full repo history from PDS via `com.atproto.sync.getRepo`, then delivers historical events (`live: false`) through the normal WebSocket channel
6363+4. **Indexer processes normally**: The indexer's existing ingestion loop handles backfill events the same as live events — no special backfill code path needed
6464+6565+### Rate Limiting
6666+6767+- Batch `/repos/add` calls (e.g., 10 DIDs per request)
6868+- Add configurable delay between batches to avoid overwhelming Tap
6969+- Respect Tap's processing capacity — monitor `/stats/repo-count` to track progress
7070+7171+## 5. Deduplication
7272+7373+- **User-level**: Maintain a visited set of DIDs during fan-out; skip already-seen DIDs
7474+- **Tap-level**: Tap's `/repos/add` is idempotent — adding an already-tracked DID is a no-op
7575+- **Record-level**: The indexer's upsert logic (keyed on `did|collection|rkey`) handles duplicate events naturally
7676+7777+## 6. CLI Interface
7878+7979+```bash
8080+# Basic backfill from seed file
8181+twister backfill --seeds seeds.txt
8282+8383+# Limit fan-out depth
8484+twister backfill --seeds seeds.txt --max-hops 1
8585+8686+# Preview discovered users without triggering backfill
8787+twister backfill --seeds seeds.txt --dry-run
8888+8989+# Control parallelism
9090+twister backfill --seeds seeds.txt --concurrency 5
9191+```
9292+9393+### Flags
9494+9595+| Flag | Default | Description |
9696+|------|---------|-------------|
9797+| `--seeds` | required | Path to seed file |
9898+| `--max-hops` | `2` | Max fan-out depth from seed users |
9999+| `--dry-run` | `false` | List discovered users without submitting to Tap |
100100+| `--concurrency` | `5` | Parallel discovery workers |
101101+| `--batch-size` | `10` | DIDs per `/repos/add` call |
102102+| `--batch-delay` | `1s` | Delay between batches |
103103+104104+### Output
105105+106106+Progress is logged to stdout:
107107+108108+```text
109109+[hop 0] Processing 5 seed users...
110110+[hop 0] did:plc:abc123 → 12 follows, 3 collaborators
111111+[hop 0] did:plc:def456 → 8 follows, 1 collaborator
112112+[hop 1] Processing 24 discovered users (18 new)...
113113+...
114114+[done] Discovered 142 unique users across 2 hops
115115+[done] Submitted 98 new DIDs to Tap (44 already tracked)
116116+```
117117+118118+## 7. Idempotency
119119+120120+The entire backfill process is safe to re-run:
121121+122122+- Seed file parsing is stateless
123123+- Fan-out discovery is deterministic for a given network state
124124+- Tap's `/repos/add` is idempotent
125125+- The indexer's upsert logic handles re-delivered events
126126+- No local state is persisted between runs (the crawl queue is in-memory)
127127+128128+## 8. Configuration
129129+130130+| Variable | Default | Description |
131131+|----------|---------|-------------|
132132+| `TAP_URL` | (existing) | Tap base URL for API calls |
133133+| `TAP_AUTH_PASSWORD` | (existing) | Tap admin auth |
134134+| `TURSO_DATABASE_URL` | (existing) | For checking existing records |
135135+| `TURSO_AUTH_TOKEN` | (existing) | DB auth |
136136+137137+No new environment variables are needed — backfill reuses existing Tap and DB configuration.
+21
packages/api/docs/specs/README.md
···11+---
22+title: "Twister — Technical Specification Index"
33+updated: 2026-03-22
44+---
55+66+# Twister Technical Specifications
77+88+Twister is a Go-based search service for [Tangled](https://tangled.org) content on AT Protocol.
99+It ingests records through [Tap](https://github.com/bluesky-social/indigo/tree/main/cmd/tap), denormalizes them into search documents, indexes them in [Turso/libSQL](https://docs.turso.tech), and exposes keyword, semantic, and hybrid search APIs.
1010+1111+## Specifications
1212+1313+| # | Document | Description |
1414+|---|----------|-------------|
1515+| 1 | [Architecture](01-architecture.md) | Purpose, goals, design principles, system context, tech choices |
1616+| 2 | [Tangled Lexicons](02-tangled-lexicons.md) | `sh.tangled.*` record schemas and fields |
1717+| 3 | [Data Model](03-data-model.md) | Database schema, search documents, sync state |
1818+| 4 | [Data Pipeline](04-data-pipeline.md) | Tap integration, normalization, failure handling |
1919+| 5 | [Search](05-search.md) | Search modes, API contract, scoring, filtering |
2020+| 6 | [Operations](06-operations.md) | Configuration, observability, security, deployment |
2121+| 7 | [Graph Backfill](07-graph-backfill.md) | Seed-based user discovery and content backfill |
+38
packages/api/docs/tasks/README.md
···11+---
22+title: "Twister — Task Index"
33+updated: 2026-03-22
44+---
55+66+# Twister Tasks
77+88+Assumes Go, Tap (deployed on Railway), Turso/libSQL, and Railway for deployment.
99+1010+## Delivery Strategy
1111+1212+Build in four phases:
1313+1414+1. **MVP** — ingestion, keyword search, deployment, operational tooling, graph backfill
1515+2. **Semantic Search** — embeddings, vector retrieval
1616+3. **Hybrid Search** — weighted merge of keyword + semantic
1717+4. **Quality Polish** — ranking refinement, advanced filters, analytics
1818+1919+Ship keyword search before embeddings. That gives a testable, inspectable baseline before introducing model behavior.
2020+2121+## Phases
2222+2323+| Phase | Title | Document | Status |
2424+| ----- | ----- | -------- | ------ |
2525+| 1 | MVP | [phase-1-mvp.md](phase-1-mvp.md) | In progress (M0–M2 complete) |
2626+| 2 | Semantic Search | [phase-2-semantic.md](phase-2-semantic.md) | Not started |
2727+| 3 | Hybrid Search | [phase-3-hybrid.md](phase-3-hybrid.md) | Not started |
2828+| 4 | Quality Polish | [phase-4-quality.md](phase-4-quality.md) | Not started |
2929+3030+## MVP Complete When
3131+3232+- Tap ingests tracked `sh.tangled.*` records
3333+- Documents normalize into a stable store
3434+- Keyword search works publicly
3535+- API and indexer are deployed on Railway
3636+- Restart does not lose sync position
3737+- Reindex exists for repair
3838+- Graph backfill populates initial content from seed users
+407
packages/api/docs/tasks/phase-1-mvp.md
···11+---
22+title: "Phase 1 — MVP"
33+updated: 2026-03-22
44+---
55+66+# Phase 1 — MVP
77+88+Get a searchable product online: ingestion, keyword search, deployment, and operational tooling.
99+1010+## MVP Complete When
1111+1212+- Tap ingests tracked `sh.tangled.*` records
1313+- Documents normalize into a stable store
1414+- Keyword search works publicly
1515+- API and indexer are deployed on Railway
1616+- Restart does not lose sync position
1717+- Reindex exists for repair
1818+- Graph backfill populates initial content from seed users
1919+2020+---
2121+2222+## M0 — Repository Bootstrap ✅
2323+2424+Executable layout, local tooling, and development conventions (completed 2026-03-22).
2525+2626+---
2727+2828+## M1 — Database Schema and Store Layer ✅
2929+3030+refs: [specs/03-data-model.md](../specs/03-data-model.md)
3131+3232+Implemented the Turso/libSQL schema and Go store package for document persistence.
3333+3434+---
3535+3636+## M2 — Normalization Layer ✅
3737+3838+refs: [specs/02-tangled-lexicons.md](../specs/02-tangled-lexicons.md), [specs/04-data-pipeline.md](../specs/04-data-pipeline.md)
3939+4040+Translate `sh.tangled.*` records into internal search documents.
4141+4242+---
4343+4444+## M3 — Tap Client and Ingestion Loop
4545+4646+refs: [specs/04-data-pipeline.md](../specs/04-data-pipeline.md), [specs/01-architecture.md](../specs/01-architecture.md)
4747+4848+### Goal
4949+5050+Connect the indexer to Tap (on Railway) and process live events into the store.
5151+5252+### Why Now
5353+5454+Tap is the point of truth for synchronized ATProto ingestion. It is already deployed on Railway.
5555+5656+### Deliverables
5757+5858+- Tap WebSocket client package (`internal/tapclient/`)
5959+- Event decode layer (record events + identity events)
6060+- Ingestion loop with retry/backoff
6161+- Cursor persistence coupled to successful DB commits
6262+- Identity event handler (DID → handle cache)
6363+6464+### Tasks
6565+6666+- [ ] Define Tap event DTOs matching the documented event shape:
6767+6868+ ```go
6969+ type TapEvent struct {
7070+ ID int64 `json:"id"`
7171+ Type string `json:"type"` // "record" or "identity"
7272+ Record *TapRecord `json:"record"`
7373+ Identity *TapIdentity `json:"identity"`
7474+ }
7575+ type TapRecord struct {
7676+ Live bool `json:"live"`
7777+ Rev string `json:"rev"`
7878+ DID string `json:"did"`
7979+ Collection string `json:"collection"`
8080+ RKey string `json:"rkey"`
8181+ Action string `json:"action"` // "create", "update", "delete"
8282+ CID string `json:"cid"`
8383+ Record json.RawMessage `json:"record"`
8484+ }
8585+ type TapIdentity struct {
8686+ DID string `json:"did"`
8787+ Handle string `json:"handle"`
8888+ IsActive bool `json:"isActive"`
8989+ Status string `json:"status"`
9090+ }
9191+ ```
9292+9393+- [ ] Implement WebSocket client:
9494+ - Connect to `TAP_URL` (e.g., `wss://tap.railway.internal/channel`)
9595+ - HTTP Basic auth with `admin:TAP_AUTH_PASSWORD`
9696+ - Auto-reconnect with exponential backoff
9797+ - Ack protocol: send event `id` back after successful processing
9898+- [ ] Implement ingestion loop:
9999+ 1. Receive event from WebSocket
100100+ 2. If `type == "identity"` → update handle cache, ack, continue
101101+ 3. If `type == "record"` → check collection allowlist
102102+ 4. Map `action` to operation (create/update → upsert, delete → tombstone)
103103+ 5. Decode `record.record` via adapter registry
104104+ 6. Normalize to `Document`
105105+ 7. Upsert to store
106106+ 8. Schedule embedding job if eligible (Phase 2)
107107+ 9. Persist cursor (event ID) after successful DB commit
108108+ 10. Ack the event
109109+- [ ] Implement collection allowlist from `INDEXED_COLLECTIONS` config
110110+- [ ] Handle state events (`sh.tangled.repo.issue.state`, `sh.tangled.repo.pull.status`) → update `record_state`
111111+- [ ] Handle normalization failures: log, skip, advance cursor
112112+- [ ] Handle DB failures: retry with backoff, do not advance cursor
113113+114114+### Verification
115115+116116+- [ ] Indexer connects to Tap via WebSocket in development
117117+- [ ] A newly created tracked record appears in `documents` table
118118+- [ ] An updated record changes the existing row (CID changes)
119119+- [ ] A delete event tombstones the row (`deleted_at` set)
120120+- [ ] Killing and restarting the indexer resumes from persisted cursor without duplication
121121+- [ ] Identity events update handle cache
122122+- [ ] Unsupported collections are silently skipped
123123+- [ ] Connection drops trigger automatic reconnection
124124+125125+### Exit Criteria
126126+127127+The system continuously ingests and persists `sh.tangled.*` records from Tap.
128128+129129+---
130130+131131+## M4 — Keyword Search API
132132+133133+refs: [specs/05-search.md](../specs/05-search.md)
134134+135135+### Goal
136136+137137+Expose a usable public search API backed by Turso's Tantivy-backed FTS.
138138+139139+### Why Now
140140+141141+First real product milestone. Searchable Tangled content without waiting for embeddings.
142142+143143+### Deliverables
144144+145145+- HTTP server (chi or net/http)
146146+- `GET /healthz` — liveness
147147+- `GET /readyz` — readiness (DB connectivity)
148148+- `GET /search` — keyword search with configurable mode
149149+- `GET /search/keyword` — keyword-only search
150150+- `GET /documents/{id}` — document lookup
151151+- Search repository layer (FTS queries isolated from handlers)
152152+- Pagination, filtering, snippets
153153+154154+### Tasks
155155+156156+- [ ] Set up HTTP server with chi router
157157+- [ ] Implement `/healthz` (always 200) and `/readyz` (SELECT 1 against DB)
158158+- [ ] Implement search repository with FTS queries:
159159+160160+ ```sql
161161+ SELECT id, title, summary, repo_name, author_handle, collection, record_type,
162162+ created_at, updated_at,
163163+ fts_score(title, body, summary, repo_name, author_handle, tags_json, ?) AS score,
164164+ fts_highlight(body, '<mark>', '</mark>', ?) AS body_snippet
165165+ FROM documents
166166+ WHERE fts_match(title, body, summary, repo_name, author_handle, tags_json, ?)
167167+ AND deleted_at IS NULL
168168+ ORDER BY score DESC
169169+ LIMIT ? OFFSET ?;
170170+ ```
171171+172172+- [ ] Implement request validation:
173173+ - `q` required, non-empty
174174+ - `limit` 1–100, default 20
175175+ - `offset` >= 0, default 0
176176+ - Reject unknown parameters with 400
177177+- [ ] Implement filters (as WHERE clauses):
178178+ - `collection` → `d.collection = ?`
179179+ - `type` → `d.record_type = ?`
180180+ - `author` → `d.author_handle = ?` or `d.did = ?`
181181+ - `repo` → `d.repo_name = ?`
182182+- [ ] Implement `/documents/{id}` — full document response
183183+- [ ] Implement stable JSON response contract (see spec 05-search.md)
184184+- [ ] Exclude tombstoned documents (`deleted_at IS NOT NULL`) by default
185185+- [ ] Add request logging middleware (method, path, status, duration)
186186+- [ ] Add CORS headers if needed
187187+188188+### Verification
189189+190190+- [ ] Searching by exact repo name returns the expected repo first
191191+- [ ] Searching by title term returns expected documents
192192+- [ ] Searching by author handle returns relevant docs
193193+- [ ] Tombstoned documents do not appear
194194+- [ ] Malformed query parameters return 400 with error JSON
195195+- [ ] DB outage causes `/readyz` to fail (503)
196196+- [ ] Pagination works: `offset=0&limit=5` then `offset=5&limit=5` returns different results
197197+- [ ] Filter by collection returns only matching docs
198198+199199+### Exit Criteria
200200+201201+A user can search Tangled content reliably with keyword search.
202202+203203+---
204204+205205+## M5 — Railway Deployment
206206+207207+refs: [specs/06-operations.md](../specs/06-operations.md)
208208+209209+### Goal
210210+211211+Deploy the API and indexer as Railway services alongside Tap.
212212+213213+### Why Now
214214+215215+At this point, the product is useful enough to run continuously.
216216+217217+### Deliverables
218218+219219+- Finalized Dockerfile
220220+- Railway project with services: `api`, `indexer`
221221+- Health checks configured per service
222222+- Secrets/env vars set
223223+- Production startup commands documented
224224+225225+### Tasks
226226+227227+- [ ] Finalize Dockerfile (multi-stage, CGO_ENABLED=0, Alpine runtime)
228228+- [ ] Create Railway services:
229229+ - `api` — start command: `twister api`
230230+ - `indexer` — start command: `twister indexer`
231231+- [ ] Configure environment variables per service:
232232+ - Shared: `TURSO_DATABASE_URL`, `TURSO_AUTH_TOKEN`, `LOG_LEVEL`, `LOG_FORMAT`
233233+ - API: `HTTP_BIND_ADDR`, `SEARCH_DEFAULT_LIMIT`, `SEARCH_MAX_LIMIT`
234234+ - Indexer: `TAP_URL` (reference Tap service domain), `TAP_AUTH_PASSWORD`, `INDEXED_COLLECTIONS`
235235+- [ ] Configure health checks:
236236+ - API: HTTP check on `/healthz` port 8080
237237+ - Indexer: HTTP check on `/health` port 9090
238238+- [ ] Use Railway internal networking for indexer → Tap connection
239239+- [ ] Connect GitHub repo for autodeploy
240240+- [ ] Test graceful shutdown on redeploy (SIGTERM handling)
241241+- [ ] Document deploy steps
242242+243243+### Verification
244244+245245+- [ ] API service becomes healthy and routable (public URL)
246246+- [ ] Indexer service starts and stays healthy
247247+- [ ] A new Tangled record ingested post-deploy becomes searchable
248248+- [ ] A redeploy preserves API availability
249249+- [ ] A restart does not lose sync position (cursor persisted)
250250+- [ ] Health checks correctly report status
251251+252252+### Exit Criteria
253253+254254+The system runs as a deployed service with health-checked processes on Railway.
255255+256256+---
257257+258258+## M6 — Reindex and Repair
259259+260260+refs: [specs/05-search.md](../specs/05-search.md)
261261+262262+### Goal
263263+264264+Make the system recoverable and operable with repair tools.
265265+266266+### Why Now
267267+268268+Search systems are never perfect on first ingestion. Repair tools are needed before production.
269269+270270+### Deliverables
271271+272272+- `twister reindex` command with scoping options
273273+- Dry-run mode
274274+- Admin reindex endpoint (optional)
275275+- Progress logging and error summary
276276+277277+### Tasks
278278+279279+- [ ] Implement `reindex` subcommand with flags:
280280+ - `--collection` — reindex one collection
281281+ - `--did` — reindex one DID's documents
282282+ - `--document` — reindex one document by ID
283283+ - `--dry-run` — show intended work without writes
284284+ - No flags → reindex all
285285+- [ ] Implement reindex logic:
286286+ 1. Select documents matching scope
287287+ 2. For each document, re-run normalization from stored fields (or re-fetch if source available)
288288+ 3. Update FTS-relevant fields
289289+ 4. Upsert back to store
290290+ 5. Log progress (N/total, errors)
291291+- [ ] Implement `POST /admin/reindex` endpoint (behind `ENABLE_ADMIN_ENDPOINTS` + `ADMIN_AUTH_TOKEN`)
292292+- [ ] Add error summary output on completion
293293+- [ ] Exit non-zero on unrecoverable failures
294294+295295+### Verification
296296+297297+- [ ] Reindexing one document updates its stored normalized text
298298+- [ ] Reindexing one collection repairs intentionally corrupted rows
299299+- [ ] Dry-run shows intended work without writes
300300+- [ ] Reindex command exits non-zero on failures
301301+- [ ] Admin endpoint triggers reindex when enabled
302302+303303+### Exit Criteria
304304+305305+Operators can repair bad indexes without rebuilding everything manually.
306306+307307+---
308308+309309+## M7 — Observability
310310+311311+refs: [specs/06-operations.md](../specs/06-operations.md)
312312+313313+### Goal
314314+315315+Make the system diagnosable in production.
316316+317317+### Deliverables
318318+319319+- Structured slog fields across all services
320320+- Error classification
321321+- Ingestion lag visibility
322322+- Periodic state logs
323323+- Operator documentation
324324+325325+### Tasks
326326+327327+- [ ] Standardize slog fields across all packages:
328328+ - `service`, `event_name`, `event_id`, `did`, `collection`, `rkey`, `document_id`, `cursor`, `error_class`, `duration_ms`
329329+- [ ] Add error classification (normalize_error, db_error, tap_error, embed_error)
330330+- [ ] Add periodic state logs in indexer:
331331+ - Current cursor position
332332+ - Events processed since last log
333333+ - Documents in store (count)
334334+- [ ] Add request logging in API (method, path, status, duration, query)
335335+- [ ] Add search latency logging per query mode
336336+- [ ] Write operator documentation:
337337+ - Restart procedure
338338+ - Reindex procedure
339339+ - Backfill notes
340340+ - Failure triage guide
341341+342342+### Verification
343343+344344+- [ ] A failed Tap decode surfaces enough context to debug (collection, DID, rkey, error class)
345345+- [ ] DB connectivity failures are visible in logs and readiness
346346+- [ ] Operator can follow the runbook to diagnose a broken indexer
347347+- [ ] Search latency is logged per request
348348+349349+### Exit Criteria
350350+351351+The system is maintainable without guesswork.
352352+353353+---
354354+355355+## M-New — Graph Backfill from Seed Users
356356+357357+refs: [specs/07-graph-backfill.md](../specs/07-graph-backfill.md)
358358+359359+### Goal
360360+361361+Bootstrap the search index with existing Tangled content by discovering and backfilling users from a seed set.
362362+363363+### Why Now
364364+365365+Before MVP launch, the index needs existing content. Live ingestion only captures new events — backfill populates historical data.
366366+367367+### Deliverables
368368+369369+- `twister backfill` CLI command
370370+- Seed file parser
371371+- Graph fan-out discovery (follows/collaborators)
372372+- Tap `/repos/add` integration for discovered users
373373+- Deduplication against already-indexed users
374374+- Progress logging
375375+376376+### Tasks
377377+378378+- [ ] Implement `backfill` subcommand with flags:
379379+ - `--seeds <file>` — path to seed file (one DID or handle per line)
380380+ - `--max-hops <n>` — depth limit for fan-out (default: 2)
381381+ - `--dry-run` — show discovered users without triggering backfill
382382+ - `--concurrency <n>` — parallel discovery workers (default: 5)
383383+- [ ] Implement seed file parser (supports DIDs and handles, comments with `#`)
384384+- [ ] Implement graph fan-out:
385385+ 1. For each seed user, resolve DID if handle provided
386386+ 2. Fetch `sh.tangled.graph.follow` records for the user
387387+ 3. Fetch collaborators from repos owned by the user
388388+ 4. Add discovered DIDs to the crawl queue
389389+ 5. Repeat up to `max-hops` depth
390390+- [ ] Integrate with Tap `/repos/add` to register discovered DIDs for tracking
391391+- [ ] Deduplicate: skip DIDs already tracked by Tap (check via `/info/:did`)
392392+- [ ] Log progress: seeds processed, users discovered per hop, DIDs submitted to Tap
393393+- [ ] Handle rate limiting and errors gracefully (retry with backoff)
394394+- [ ] Make idempotent: safe to re-run; Tap handles duplicate `/repos/add` calls
395395+396396+### Verification
397397+398398+- [ ] Running with a seed file of 3 known users discovers their followers
399399+- [ ] `--max-hops 1` limits discovery to direct connections only
400400+- [ ] `--dry-run` lists discovered DIDs without calling Tap
401401+- [ ] Already-tracked users are skipped
402402+- [ ] Re-running the same seed file produces no duplicate work
403403+- [ ] Tap begins backfilling records for newly added DIDs
404404+405405+### Exit Criteria
406406+407407+The index contains historical content from the seed user graph, not just new events.
+124
packages/api/docs/tasks/phase-2-semantic.md
···11+---
22+title: "Phase 2 — Semantic Search"
33+updated: 2026-03-22
44+---
55+66+# Phase 2 — Semantic Search
77+88+Add embedding generation and vector-based retrieval on top of the keyword baseline.
99+1010+---
1111+1212+## M8 — Embedding Pipeline
1313+1414+refs: [specs/03-data-model.md](../specs/03-data-model.md), [specs/05-search.md](../specs/05-search.md)
1515+1616+### Goal
1717+1818+Add asynchronous embedding generation without blocking ingestion.
1919+2020+### Why Now
2121+2222+Only after keyword search is stable should semantic complexity be added.
2323+2424+### Deliverables
2525+2626+- `embedding_jobs` table operational (schema from M1)
2727+- `embed-worker` subcommand
2828+- Embedding provider abstraction (OpenAI, Voyage, Ollama)
2929+- Retry and dead-letter behavior
3030+- `twister reembed` command
3131+3232+### Tasks
3333+3434+- [ ] Define embedding provider interface:
3535+3636+ ```go
3737+ type EmbeddingProvider interface {
3838+ Embed(ctx context.Context, texts []string) ([][]float32, error)
3939+ Model() string
4040+ Dimension() int
4141+ }
4242+ ```
4343+4444+- [ ] Implement OpenAI provider (or preferred provider)
4545+- [ ] Implement embedding input text composition (see spec 04-data-pipeline.md, section 5):
4646+ `title\nrepo_name\nauthor_handle\ntags\nsummary\nbody`
4747+- [ ] Add job enqueueing: on document upsert, insert `embedding_jobs` row with `status=pending`
4848+- [ ] Implement `embed-worker` loop:
4949+ 1. Poll for `pending` jobs (batch by `EMBEDDING_BATCH_SIZE`)
5050+ 2. Compose input text per document
5151+ 3. Call embedding provider
5252+ 4. Store vectors in `document_embeddings` with `vector32(?)`
5353+ 5. Mark job `completed`
5454+ 6. On failure: increment `attempts`, set `last_error`, backoff
5555+ 7. After max attempts: mark `dead`
5656+- [ ] Create DiskANN vector index: `CREATE INDEX idx_embeddings_vec ON document_embeddings(libsql_vector_idx(embedding, 'metric=cosine'))`
5757+- [ ] Implement `reembed` command (re-generate all embeddings, useful for model migration)
5858+- [ ] Skip deleted documents in embedding pipeline
5959+- [ ] Add health check endpoint for embed-worker (port 9091)
6060+6161+### Verification
6262+6363+- [ ] Creating a new searchable document enqueues an embedding job
6464+- [ ] Worker processes the job and stores a vector in `document_embeddings`
6565+- [ ] Failed embedding calls retry with bounded attempts
6666+- [ ] Keyword search still works when embed-worker is down
6767+- [ ] `reembed` regenerates embeddings for all eligible documents
6868+6969+### Exit Criteria
7070+7171+Embeddings are produced asynchronously and stored durably.
7272+7373+---
7474+7575+## M9 — Semantic Search
7676+7777+refs: [specs/05-search.md](../specs/05-search.md)
7878+7979+### Goal
8080+8181+Expose vector-based semantic retrieval.
8282+8383+### Why Now
8484+8585+Natural next step once embeddings exist. Turso/libSQL has native vector search with `vector_top_k`.
8686+8787+### Deliverables
8888+8989+- `GET /search/semantic` endpoint
9090+- Query-time embedding (convert query text → vector)
9191+- Vector similarity search via `vector_top_k`
9292+- Response parity with keyword search
9393+9494+### Tasks
9595+9696+- [ ] Implement query embedding: call embedding provider with user's query text
9797+- [ ] Implement semantic search repository:
9898+9999+ ```sql
100100+ SELECT d.id, d.title, d.summary, d.repo_name, d.author_handle,
101101+ d.collection, d.record_type, d.created_at, d.updated_at
102102+ FROM vector_top_k('idx_embeddings_vec', vector32(?), ?) AS v
103103+ JOIN document_embeddings e ON e.rowid = v.id
104104+ JOIN documents d ON d.id = e.document_id
105105+ WHERE d.deleted_at IS NULL;
106106+ ```
107107+108108+- [ ] Normalize distance to relevance score: `score = 1.0 - (distance / 2.0)`
109109+- [ ] Apply same filters as keyword search (collection, author, repo, type)
110110+- [ ] Add timeout and cost controls (limit vector search to reasonable K)
111111+- [ ] Wire `/search/semantic` handler
112112+- [ ] Return `matched_by: ["semantic"]` in results
113113+114114+### Verification
115115+116116+- [ ] Semantically similar queries retrieve expected documents even with little lexical overlap
117117+- [ ] Documents without embeddings are omitted from semantic results
118118+- [ ] Semantic search returns the same JSON schema as keyword search
119119+- [ ] Latency is acceptable under small test load
120120+- [ ] Filters work correctly with semantic results
121121+122122+### Exit Criteria
123123+124124+The API supports true semantic search over Tangled documents.
+53
packages/api/docs/tasks/phase-3-hybrid.md
···11+---
22+title: "Phase 3 — Hybrid Search"
33+updated: 2026-03-22
44+---
55+66+# Phase 3 — Hybrid Search
77+88+Merge lexical and semantic search into the default high-quality retrieval mode.
99+1010+---
1111+1212+## M10 — Hybrid Search
1313+1414+refs: [specs/05-search.md](../specs/05-search.md)
1515+1616+### Deliverables
1717+1818+- `GET /search/hybrid` endpoint
1919+- Weighted score blending (keyword 0.65 + semantic 0.35)
2020+- Score normalization
2121+- Result deduplication
2222+- `matched_by` metadata showing which modes contributed
2323+2424+### Tasks
2525+2626+- [ ] Implement hybrid search orchestrator:
2727+ 1. Fetch top N keyword results (N=50 or configurable)
2828+ 2. Fetch top N semantic results
2929+ 3. Normalize keyword scores (min-max within result set)
3030+ 4. Semantic scores already normalized (0–1)
3131+ 5. Merge on `document_id`
3232+ 6. For documents in both sets: `hybrid_score = 0.65 * keyword + 0.35 * semantic`
3333+ 7. For documents in one set: use available score (other = 0)
3434+ 8. Sort by hybrid_score descending
3535+ 9. Deduplicate
3636+ 10. Apply limit/offset
3737+- [ ] Populate `matched_by` field: `["keyword"]`, `["semantic"]`, or `["keyword", "semantic"]`
3838+- [ ] Make weights configurable via `HYBRID_KEYWORD_WEIGHT` / `HYBRID_SEMANTIC_WEIGHT`
3939+- [ ] Wire `/search/hybrid` handler
4040+- [ ] Make `/search?mode=hybrid` work
4141+4242+### Verification
4343+4444+- [ ] Hybrid returns documents found by either source
4545+- [ ] Duplicates are merged correctly (no duplicate IDs in results)
4646+- [ ] Exact-match queries still favor lexical relevance
4747+- [ ] Exploratory natural-language queries improve over keyword-only results
4848+- [ ] Score ordering is stable across repeated runs on the same corpus
4949+- [ ] `matched_by` accurately reflects which modes produced each result
5050+5151+### Exit Criteria
5252+5353+Hybrid search becomes the preferred default search mode.
+49
packages/api/docs/tasks/phase-4-quality.md
···11+---
22+title: "Phase 4 — Ranking and Quality Polish"
33+updated: 2026-03-22
44+---
55+66+# Phase 4 — Ranking and Quality Polish
77+88+Improve search quality without changing the core architecture.
99+1010+---
1111+1212+## M11 — Ranking and Quality Polish
1313+1414+refs: [specs/05-search.md](../specs/05-search.md)
1515+1616+### Deliverables
1717+1818+- Boosted field weighting refinement
1919+- Recency boost
2020+- Collection-aware ranking
2121+- Better snippets/highlights
2222+- Issue/PR state filtering
2323+- Star count as ranking signal
2424+- Optional query analytics
2525+2626+### Tasks
2727+2828+- [ ] Tune FTS index weights based on real query results
2929+- [ ] Add small recency boost to ranking (e.g., decay function on `created_at`)
3030+- [ ] Add collection-aware ranking adjustments (repos ranked differently from comments)
3131+- [ ] Index `sh.tangled.repo.issue.comment` and `sh.tangled.repo.pull.comment` (P2 collections)
3232+- [ ] Aggregate `sh.tangled.feed.star` counts per repo and use as ranking signal
3333+- [ ] Implement `state` filter (open/closed/merged) using `record_state` table
3434+- [ ] Improve snippets: better truncation, multi-field highlights
3535+- [ ] Add curated relevance test fixtures (expected queries → expected top results)
3636+- [ ] Run `OPTIMIZE INDEX idx_documents_fts` as maintenance task
3737+- [ ] Optional: log queries for analytics (anonymized)
3838+3939+### Verification
4040+4141+- [ ] Exact repo lookups reliably rank the repo first
4242+- [ ] Recent active content gets a reasonable small boost without overwhelming exact relevance
4343+- [ ] Snippets show useful matched context
4444+- [ ] Ranking regression tests catch obvious degradations
4545+- [ ] State filter correctly excludes closed/merged items when requested
4646+4747+### Exit Criteria
4848+4949+Search quality is noticeably improved and more predictable.
···11+CREATE TABLE IF NOT EXISTS documents (
22+ id TEXT PRIMARY KEY,
33+ did TEXT NOT NULL,
44+ collection TEXT NOT NULL,
55+ rkey TEXT NOT NULL,
66+ at_uri TEXT NOT NULL,
77+ cid TEXT NOT NULL,
88+ record_type TEXT NOT NULL,
99+ title TEXT,
1010+ body TEXT,
1111+ summary TEXT,
1212+ repo_did TEXT,
1313+ repo_name TEXT,
1414+ author_handle TEXT,
1515+ tags_json TEXT,
1616+ language TEXT,
1717+ created_at TEXT,
1818+ updated_at TEXT,
1919+ indexed_at TEXT NOT NULL,
2020+ deleted_at TEXT
2121+);
2222+2323+CREATE INDEX IF NOT EXISTS idx_documents_did ON documents(did);
2424+2525+CREATE INDEX IF NOT EXISTS idx_documents_collection ON documents(collection);
2626+2727+CREATE INDEX IF NOT EXISTS idx_documents_record_type ON documents(record_type);
2828+2929+CREATE INDEX IF NOT EXISTS idx_documents_repo_did ON documents(repo_did);
3030+3131+CREATE INDEX IF NOT EXISTS idx_documents_created_at ON documents(created_at);
3232+3333+CREATE INDEX IF NOT EXISTS idx_documents_deleted_at ON documents(deleted_at);
3434+3535+CREATE INDEX IF NOT EXISTS idx_documents_fts ON documents USING fts (
3636+ title WITH tokenizer=default,
3737+ body WITH tokenizer=default,
3838+ summary WITH tokenizer=default,
3939+ repo_name WITH tokenizer=simple,
4040+ author_handle WITH tokenizer=raw,
4141+ tags_json WITH tokenizer=simple
4242+) WITH (weights='title=3.0,repo_name=2.5,author_handle=2.0,summary=1.5,tags_json=1.2,body=1.0');
4343+4444+CREATE TABLE IF NOT EXISTS sync_state (
4545+ consumer_name TEXT PRIMARY KEY,
4646+ cursor TEXT NOT NULL,
4747+ high_water_mark TEXT,
4848+ updated_at TEXT NOT NULL
4949+);
5050+5151+CREATE TABLE IF NOT EXISTS document_embeddings (
5252+ document_id TEXT PRIMARY KEY REFERENCES documents(id),
5353+ embedding F32_BLOB(768),
5454+ embedding_model TEXT NOT NULL,
5555+ embedded_at TEXT NOT NULL
5656+);
5757+5858+CREATE INDEX IF NOT EXISTS idx_embeddings_vec ON document_embeddings(
5959+ libsql_vector_idx(embedding, 'metric=cosine')
6060+);
6161+6262+CREATE TABLE IF NOT EXISTS embedding_jobs (
6363+ document_id TEXT PRIMARY KEY REFERENCES documents(id),
6464+ status TEXT NOT NULL,
6565+ attempts INTEGER NOT NULL DEFAULT 0,
6666+ last_error TEXT,
6767+ scheduled_at TEXT NOT NULL,
6868+ updated_at TEXT NOT NULL
6969+);
7070+7171+CREATE INDEX IF NOT EXISTS idx_embedding_jobs_status ON embedding_jobs(status);
7272+7373+CREATE TABLE IF NOT EXISTS record_state (
7474+ subject_uri TEXT PRIMARY KEY,
7575+ state TEXT NOT NULL,
7676+ updated_at TEXT NOT NULL
7777+);