a love letter to tangled (android, iOS, and a search API)

chore: drop semantic search (for now!)

+24 -219
+8 -26
docs/roadmap.md
··· 5 5 6 6 ## API: Search Stabilization 7 7 8 - Highest priority. This work blocks further investment in semantic search, hybrid ranking, and broader discovery features. 8 + Highest priority. This work blocks further investment in search quality and broader discovery features. 9 9 10 10 - [x] Stabilize local development and experimentation around a local `file:` database 11 11 - [x] Document backup, restore, and disk-growth procedures for the experimental local DB ··· 33 33 34 34 Completed on [2026-03-25](../CHANGELOG.md#2026-03-25) 35 35 36 - ## API: Semantic Search Pipeline 37 - 38 - Nomic Embed Text v1.5 via Railway template, async embedding pipeline. 39 - 40 - **Blocked on:** API: Search Stabilization 41 - 42 - - [ ] Deploy nomic-embed Railway template (`POST /api/embeddings` with Bearer auth) 43 - - [ ] Embedding client in Go API (`internal/embedding/`) calling the Nomic service 44 - - [ ] Embed-worker: consume `embedding_jobs` queue, generate 768-dim vectors, store in `document_embeddings` 45 - - [ ] `GET /search/semantic` endpoint using DiskANN vector_top_k 46 - - [ ] Reembed command for bulk re-generation 47 - 48 - ## API: Hybrid Search 49 - 50 - Combine keyword and semantic results. 36 + ## API: FTS5 Search Quality 51 37 52 - **Blocked on:** API: Search Stabilization, API: Semantic Search Pipeline 38 + Improve keyword search quality without external dependencies. 53 39 54 - - [ ] Score normalization (keyword BM25 → [0,1], semantic cosine → [0,1]) 55 - - [ ] Weighted merge (0.65 keyword + 0.35 semantic, configurable) 56 - - [ ] Deduplication by document ID 57 - - [ ] `matched_by` metadata in results 40 + **Depends on:** API: Search Stabilization 58 41 59 - ## API: Search Quality 60 - 61 - **Blocked on:** API: Search Stabilization 62 - 63 - - [ ] Field weight tuning based on real queries 42 + - [ ] Synonym expansion at query time (e.g. "repo" matches "repository") 43 + - [ ] Stemming tokenizer (porter or unicode61+porter) 44 + - [ ] Prefix search support for autocomplete 45 + - [ ] Field weight tuning based on real query patterns 64 46 - [ ] Recency boost for recently updated content 65 47 - [ ] Star count ranking signal (via Constellation) 66 48 - [ ] State filtering defaults (exclude closed issues)
+8 -5
docs/specs/search.md
··· 9 9 Search now has two phases: 10 10 11 11 1. Stabilize indexing and activity caching so search is cheap and reliable. 12 - 2. Resume semantic and hybrid work only after the base pipeline is stable. 12 + 2. Enhance keyword search quality with FTS5 features once the base pipeline is stable. 13 13 14 14 ## Immediate Priority 15 15 ··· 22 22 23 23 Production storage is Turso cloud. The reasoning is recorded in `docs/adr/storage.md`, with the comparison inputs in `docs/adr/pg.md` and `docs/adr/turso.md`. 24 24 25 - These tasks block further work on semantic and hybrid search. 25 + These tasks block further work on search quality improvements. 26 26 27 27 ## Planning Decisions 28 28 ··· 155 155 2. Direct API reads enqueue background indexing for misses. 156 156 3. JetStream fills only the recent-activity cache. 157 157 4. Smoke tests guard the critical paths. 158 - 5. Semantic and hybrid search remain blocked until the base pipeline is stable. 158 + 5. FTS5 quality improvements (synonyms, stemming, prefix search) follow once the base pipeline is stable. 159 159 160 160 ## Backfill Strategy 161 161 ··· 172 172 | Param | Required | Default | Description | 173 173 | ------------ | -------- | ------- | ------------------------------------- | 174 174 | `q` | Yes | — | Query string | 175 - | `mode` | No | keyword | keyword, semantic, or hybrid | 175 + | `mode` | No | keyword | keyword | 176 176 | `limit` | No | 20 | Results per page (1–100) | 177 177 | `offset` | No | 0 | Pagination offset | 178 178 | `collection` | No | — | Filter by collection NSID | ··· 227 227 228 228 4. **JetStream is for recent activity, not authoritative indexing.** Use it to power the cached feed, not to replace Tap or repo re-sync. 229 229 230 - 5. **Semantic search is additive.** It improves discovery for vague queries but is not required for the app to be useful. 230 + 5. **FTS5 enhancements are the next quality step.** Synonym expansion, stemming, and prefix search improve discovery without external dependencies. 231 231 232 232 6. **Graceful degradation.** The mobile app treats the search API as optional. If Twister is unavailable, handle-based direct browsing still works. Search results link into the same browsing screens. 233 233 234 234 ## Quality Improvements (Planned) 235 235 236 + - Synonym expansion at query time (e.g. "repo" matches "repository") 237 + - Stemming tokenizer (porter or unicode61+porter) 238 + - Prefix search support for autocomplete 236 239 - Field weight tuning based on real query patterns 237 240 - Recency boost for recently updated content 238 241 - Collection-aware ranking
-10
packages/api/.env.example
··· 10 10 SEARCH_MAX_LIMIT=100 11 11 SEARCH_DEFAULT_MODE=keyword 12 12 13 - HYBRID_KEYWORD_WEIGHT=0.65 14 - HYBRID_SEMANTIC_WEIGHT=0.35 15 - 16 - # EMBEDDING_PROVIDER=openai 17 - # EMBEDDING_MODEL=text-embedding-3-small 18 - # EMBEDDING_API_KEY=sk-... 19 - # EMBEDDING_API_URL= 20 - # EMBEDDING_DIM=768 21 - # EMBEDDING_BATCH_SIZE=32 22 - 23 13 HTTP_BIND_ADDR=:8080 24 14 LOG_LEVEL=info 25 15 LOG_FORMAT=json
+1 -9
packages/api/internal/api/api.go
··· 58 58 mux.HandleFunc("GET /oauth/client-metadata.json", s.handleOAuthClientMetadata) 59 59 mux.HandleFunc("GET /search", s.handleSearch) 60 60 mux.HandleFunc("GET /search/keyword", s.handleSearchKeyword) 61 - mux.HandleFunc("GET /search/semantic", s.handleNotImplemented) 62 - mux.HandleFunc("GET /search/hybrid", s.handleNotImplemented) 63 61 64 62 mux.HandleFunc("GET /documents/{id}", s.handleGetDocument) 65 63 mux.HandleFunc("GET /profiles/{did}/summary", s.handleProfileSummary) ··· 98 96 99 97 if s.cfg.EnableAdminEndpoints { 100 98 mux.HandleFunc("POST /admin/reindex", s.handleAdminReindex) 101 - mux.HandleFunc("POST /admin/reembed", s.handleNotImplemented) 102 99 } 103 100 104 101 site := view.Handler() ··· 224 221 switch mode { 225 222 case "keyword": 226 223 s.handleSearchKeyword(w, r) 227 - case "semantic", "hybrid": 228 - s.handleNotImplemented(w, r) 229 224 default: 230 - writeJSON(w, http.StatusBadRequest, errorBody("invalid_parameter", "mode must be keyword, semantic, or hybrid")) 225 + writeJSON(w, http.StatusBadRequest, errorBody("invalid_parameter", "mode must be keyword")) 231 226 } 232 227 } 233 228 ··· 483 478 }) 484 479 } 485 480 486 - func (s *Server) handleNotImplemented(w http.ResponseWriter, _ *http.Request) { 487 - writeJSON(w, http.StatusNotImplemented, errorBody("not_implemented", "this endpoint is not yet available")) 488 - }
-8
packages/api/internal/api/readthrough.go
··· 179 179 return fmt.Errorf("upsert document: %w", err) 180 180 } 181 181 182 - if adapter.Searchable(record) { 183 - if err := s.store.EnqueueEmbeddingJob(ctx, doc.ID); err != nil { 184 - s.log.Warn("read-through enqueue embedding failed", 185 - slog.String("document_id", doc.ID), 186 - slog.String("error", err.Error()), 187 - ) 188 - } 189 - } 190 182 return nil 191 183 } 192 184
-28
packages/api/internal/config/config.go
··· 20 20 SearchDefaultLimit int 21 21 SearchMaxLimit int 22 22 SearchDefaultMode string 23 - EmbeddingProvider string 24 - EmbeddingModel string 25 - EmbeddingAPIKey string 26 - EmbeddingAPIURL string 27 - EmbeddingDim int 28 - EmbeddingBatchSize int 29 - HybridKeywordWeight float64 30 - HybridSemanticWeight float64 31 23 HTTPBindAddr string 32 24 IndexerHealthAddr string 33 25 LogLevel string ··· 65 57 TapAuthPassword: os.Getenv("TAP_AUTH_PASSWORD"), 66 58 IndexedCollections: os.Getenv("INDEXED_COLLECTIONS"), 67 59 SearchDefaultMode: envOrDefault("SEARCH_DEFAULT_MODE", "keyword"), 68 - EmbeddingProvider: os.Getenv("EMBEDDING_PROVIDER"), 69 - EmbeddingModel: os.Getenv("EMBEDDING_MODEL"), 70 - EmbeddingAPIKey: os.Getenv("EMBEDDING_API_KEY"), 71 - EmbeddingAPIURL: os.Getenv("EMBEDDING_API_URL"), 72 60 HTTPBindAddr: envOrDefault("HTTP_BIND_ADDR", ":8080"), 73 61 IndexerHealthAddr: envOrDefault("INDEXER_HEALTH_ADDR", ":9090"), 74 62 LogLevel: envOrDefault("LOG_LEVEL", "info"), ··· 76 64 AdminAuthToken: os.Getenv("ADMIN_AUTH_TOKEN"), 77 65 SearchDefaultLimit: envInt("SEARCH_DEFAULT_LIMIT", 20), 78 66 SearchMaxLimit: envInt("SEARCH_MAX_LIMIT", 100), 79 - EmbeddingDim: envInt("EMBEDDING_DIM", 768), 80 - EmbeddingBatchSize: envInt("EMBEDDING_BATCH_SIZE", 32), 81 - HybridKeywordWeight: envFloat("HYBRID_KEYWORD_WEIGHT", 0.65), 82 - HybridSemanticWeight: envFloat("HYBRID_SEMANTIC_WEIGHT", 0.35), 83 67 EnableAdminEndpoints: envBool("ENABLE_ADMIN_ENDPOINTS", false), 84 68 EnableIngestEnrichment: envBool("ENABLE_INGEST_ENRICHMENT", true), 85 69 PLCDirectoryURL: envOrDefault("PLC_DIRECTORY_URL", "https://plc.directory"), ··· 175 159 return def 176 160 } 177 161 return n 178 - } 179 - 180 - func envFloat(key string, def float64) float64 { 181 - v := os.Getenv(key) 182 - if v == "" { 183 - return def 184 - } 185 - f, err := strconv.ParseFloat(v, 64) 186 - if err != nil { 187 - return def 188 - } 189 - return f 190 162 } 191 163 192 164 func envDuration(key string, def time.Duration) time.Duration {
-10
packages/api/internal/ingest/ingest.go
··· 269 269 return err 270 270 } 271 271 272 - if adapter.Searchable(record.Record) { 273 - if err := r.store.EnqueueEmbeddingJob(ctx, doc.ID); err != nil { 274 - r.log.Warn("embedding enqueue failed", 275 - slog.Int64("event_id", event.ID), 276 - slog.String("document_id", doc.ID), 277 - slog.String("error", err.Error()), 278 - ) 279 - } 280 - } 281 - 282 272 return r.advanceCursorAndAck(ctx, event.ID) 283 273 } 284 274
+2 -13
packages/api/internal/ingest/ingest_test.go
··· 36 36 initialSync *store.SyncState 37 37 recordStates map[string]string 38 38 handles map[string]string 39 - enqueued map[string]bool 40 - onSetSync func() 39 + onSetSync func() 41 40 } 42 41 43 42 func newFakeStore() *fakeStore { ··· 45 44 docs: make(map[string]*store.Document), 46 45 deleted: make(map[string]bool), 47 46 recordStates: make(map[string]string), 48 - handles: make(map[string]string), 49 - enqueued: make(map[string]bool), 47 + handles: make(map[string]string), 50 48 } 51 49 } 52 50 ··· 96 94 97 95 func (f *fakeStore) GetIdentityHandle(_ context.Context, did string) (string, error) { 98 96 return f.handles[did], nil 99 - } 100 - 101 - func (f *fakeStore) EnqueueEmbeddingJob(_ context.Context, documentID string) error { 102 - f.enqueued[documentID] = true 103 - return nil 104 97 } 105 98 106 99 func (f *fakeStore) EnqueueIndexingJob(_ context.Context, _ store.IndexingJobInput) error { ··· 227 220 if doc.AuthorHandle != "author.tangled.org" { 228 221 t.Fatalf("author handle: got %q", doc.AuthorHandle) 229 222 } 230 - if !st.enqueued[docID] { 231 - t.Fatalf("embedding job not enqueued for %q", docID) 232 - } 233 - 234 223 deleteEvent := normalize.TapRecordEvent{ 235 224 ID: 202, 236 225 Type: "record",
+3 -9
packages/api/internal/store/db.go
··· 16 16 //go:embed migrations/*.sql 17 17 var migrationsFS embed.FS 18 18 19 - var extensionMigrationNoticeLogged bool 20 - 21 19 type migrationMode struct { 22 20 allowTursoExtensionSkip bool 23 21 targetDescription string ··· 189 187 if _, err := db.Exec(stmt); err != nil { 190 188 upper := strings.ToUpper(stmt) 191 189 if strings.Contains(upper, "LIBSQL_VECTOR_IDX") { 192 - if !extensionMigrationNoticeLogged { 193 - extensionMigrationNoticeLogged = true 194 - slog.Info("migration: skipping unsupported extension index", 195 - "migration", name, 196 - "reason", "database engine does not support vector index DDL in this environment", 197 - ) 198 - } 190 + slog.Debug("migration: skipping unsupported vector index DDL", 191 + "migration", name, 192 + ) 199 193 continue 200 194 } 201 195 if strings.Contains(upper, "CREATE VIRTUAL TABLE") && strings.Contains(upper, "USING FTS5") {
+2
packages/api/internal/store/migrations/007_drop_embeddings.sql
··· 1 + DROP TABLE IF EXISTS document_embeddings; 2 + DROP TABLE IF EXISTS embedding_jobs;
-18
packages/api/internal/store/sql_store.go
··· 253 253 return handle.String, nil 254 254 } 255 255 256 - func (s *SQLStore) EnqueueEmbeddingJob(ctx context.Context, documentID string) error { 257 - now := time.Now().UTC().Format(time.RFC3339) 258 - _, err := s.db.ExecContext(ctx, ` 259 - INSERT INTO embedding_jobs (document_id, status, attempts, last_error, scheduled_at, updated_at) 260 - VALUES (?, 'pending', 0, NULL, ?, ?) 261 - ON CONFLICT(document_id) DO UPDATE SET 262 - status = 'pending', 263 - last_error = NULL, 264 - scheduled_at = excluded.scheduled_at, 265 - updated_at = excluded.updated_at`, 266 - documentID, now, now, 267 - ) 268 - if err != nil { 269 - return fmt.Errorf("enqueue embedding job: %w", err) 270 - } 271 - return nil 272 - } 273 - 274 256 func (s *SQLStore) EnqueueIndexingJob(ctx context.Context, input IndexingJobInput) error { 275 257 now := time.Now().UTC().Format(time.RFC3339) 276 258 _, err := s.db.ExecContext(ctx, `
-1
packages/api/internal/store/store.go
··· 110 110 UpdateRecordState(ctx context.Context, subjectURI string, state string) error 111 111 UpsertIdentityHandle(ctx context.Context, did, handle string, isActive bool, status string) error 112 112 GetIdentityHandle(ctx context.Context, did string) (string, error) 113 - EnqueueEmbeddingJob(ctx context.Context, documentID string) error 114 113 EnqueueIndexingJob(ctx context.Context, input IndexingJobInput) error 115 114 ClaimIndexingJob(ctx context.Context) (*IndexingJob, error) 116 115 CompleteIndexingJob(ctx context.Context, documentID string) error
-44
packages/api/internal/store/store_test.go
··· 2 2 3 3 import ( 4 4 "context" 5 - "database/sql" 6 5 "os" 7 6 "path/filepath" 8 7 "testing" ··· 191 190 } 192 191 if handle != "alice2.tangled.org" { 193 192 t.Fatalf("handle after update: got %q, want %q", handle, "alice2.tangled.org") 194 - } 195 - }) 196 - 197 - t.Run("enqueue embedding job is idempotent", func(t *testing.T) { 198 - doc := &store.Document{ 199 - ID: "did:plc:embed|sh.tangled.string|abc", 200 - DID: "did:plc:embed", 201 - Collection: "sh.tangled.string", 202 - RKey: "abc", 203 - ATURI: "at://did:plc:embed/sh.tangled.string/abc", 204 - CID: "bafyreienqueue", 205 - RecordType: "string", 206 - Title: "foo.go", 207 - Body: "package main", 208 - } 209 - if err := st.UpsertDocument(ctx, doc); err != nil { 210 - t.Fatalf("upsert doc for embedding queue: %v", err) 211 - } 212 - 213 - if err := st.EnqueueEmbeddingJob(ctx, doc.ID); err != nil { 214 - t.Fatalf("enqueue embedding job: %v", err) 215 - } 216 - if err := st.EnqueueEmbeddingJob(ctx, doc.ID); err != nil { 217 - t.Fatalf("enqueue embedding job second call: %v", err) 218 - } 219 - 220 - row := db.QueryRowContext(ctx, `SELECT status, attempts, last_error FROM embedding_jobs WHERE document_id = ?`, doc.ID) 221 - var ( 222 - status string 223 - attempts int 224 - lastError sql.NullString 225 - ) 226 - if err := row.Scan(&status, &attempts, &lastError); err != nil { 227 - t.Fatalf("query embedding job: %v", err) 228 - } 229 - if status != "pending" { 230 - t.Fatalf("status: got %q, want pending", status) 231 - } 232 - if attempts != 0 { 233 - t.Fatalf("attempts: got %d, want 0", attempts) 234 - } 235 - if lastError.Valid { 236 - t.Fatalf("last_error: got %q, want NULL", lastError.String) 237 193 } 238 194 }) 239 195
-38
packages/api/main.go
··· 48 48 newAPICmd(&local), 49 49 newIndexerCmd(&local), 50 50 newBackfillCmd(&local), 51 - newEmbedWorkerCmd(&local), 52 51 newReindexCmd(&local), 53 - newReembedCmd(&local), 54 52 newEnrichCmd(&local), 55 53 newHealthcheckCmd(&local), 56 54 ) ··· 209 207 } 210 208 } 211 209 212 - func newEmbedWorkerCmd(local *bool) *cobra.Command { 213 - return &cobra.Command{ 214 - Use: "embed-worker", 215 - Short: "Start the async embedding worker", 216 - RunE: func(cmd *cobra.Command, args []string) error { 217 - cfg, err := config.Load(config.LoadOptions{Local: *local}) 218 - if err != nil { 219 - return fmt.Errorf("config: %w", err) 220 - } 221 - log := observability.NewLogger(cfg) 222 - log.Info("starting embed-worker", slog.String("service", "embed-worker"), slog.String("version", version)) 223 - ctx, cancel := baseContext() 224 - defer cancel() 225 - <-ctx.Done() 226 - log.Info("shutting down embed-worker") 227 - return nil 228 - }, 229 - } 230 - } 231 - 232 210 func newBackfillCmd(local *bool) *cobra.Command { 233 211 var opts backfill.Options 234 212 ··· 344 322 cmd.Flags().BoolVar(&opts.DryRun, "dry-run", false, "Show intended work without writing") 345 323 346 324 return cmd 347 - } 348 - 349 - func newReembedCmd(local *bool) *cobra.Command { 350 - return &cobra.Command{ 351 - Use: "reembed", 352 - Short: "Re-generate all embeddings", 353 - RunE: func(cmd *cobra.Command, args []string) error { 354 - cfg, err := config.Load(config.LoadOptions{Local: *local}) 355 - if err != nil { 356 - return fmt.Errorf("config: %w", err) 357 - } 358 - log := observability.NewLogger(cfg) 359 - log.Info("reembed: not yet implemented") 360 - return nil 361 - }, 362 - } 363 325 } 364 326 365 327 func newEnrichCmd(local *bool) *cobra.Command {