Divepool Embedding Firehose Client (experimental)#

Example client for the Divepool embedding firehose and search API — a real-time stream of 128d EmbeddingGemma embeddings (Matryoshka truncation, L2-normalized) for Bluesky posts and profiles, plus a semantic search endpoint. Both the API and this client are highly experimental and subject to change.

Stream client#

Real-time embedding stream. Public clients receive 128d vectors; token-authenticated clients receive full 768d.

go run . stream                                 # public endpoint (128d)
go run . stream -token x                        # with bearer token (768d)
go run . stream -token x -out events.jsonl      # write events to file

Without a subcommand, defaults to stream for backwards compatibility.

Search client#

Semantic search across all indexed Bluesky posts. Bearer token optional — non-bearer gets 128d embeddings, bearer gets 768d.

# Global semantic search
go run . search "machine learning"                                     # public search (top 20)
go run . search -token x "machine learning"                            # bearer search (768d embeddings)
go run . search -limit 100 "machine learning"                          # more results (max 400 global)
go run . search -distinct=false "machine learning"                     # multiple posts per account
go run . search -cluster "machine learning"                            # with topic clusters
go run . search -embeddings "machine learning"                         # include per-result embeddings

# Account-scoped (max 1200)
go run . search -did did:plc:abc123                                    # browse account (newest, clustered)
go run . search -did did:plc:abc123 "machine learning"                 # search within account
go run . search -did did:plc:abc123 -cluster "machine learning"        # search within account + cluster

# Search by example post
go run . search -did did:plc:abc123 -rkey 3abc                         # find similar posts globally
go run . search -did did:plc:abc123 -rkey 3abc -cluster                # find similar + cluster

Search API#

POST /api/v1/search — bearer token optional.

Request:

{
  "query": "machine learning",
  "did": "did:plc:abc123",
  "rkey": "3abc",
  "limit": 100,
  "distinct": true,
  "cluster": true,
  "include_embeddings": true
}

Field	Type	Default	Description
`query`	string	—	Search query text (exclusive with `rkey`)
`did`	string	—	Scope search to a specific account (AT Protocol DID)
`rkey`	string	—	Use this post's embedding as query (requires `did`, exclusive with `query`)
`limit`	int	400/1200	Max results. Global: max 400. DID-scoped: max 1200
`distinct`	bool	true	One result per account
`cluster`	bool	false	Enable UMAP+HDBSCAN clustering with c-TF-IDF topic extraction
`include_embeddings`	bool	false	Include per-result cluster embeddings (128d non-bearer, 768d bearer)

Search modes: At least one of query, did, or rkey is required.

Mode	Fields	Behavior
Global search	`query`	Semantic search across all posts (max 400)
Browse account	`did`	Newest posts, always clustered (max 1200)
Search in account	`did` + `query`	Semantic search within account's posts (max 1200)
Similar posts	`did` + `rkey`	Use post's embedding as query for global search

Response:

{
  "results": [
    {
      "did": "did:plc:abc123",
      "handle": "user.bsky.social",
      "collection": "app.bsky.feed.post",
      "rkey": "3abc",
      "text": "Post text...",
      "score": -0.82,
      "created_at": "2025-01-15T10:30:00Z",
      "detected_lang": "en",
      "cluster_id": 0,
      "topics": ["topic1", "topic2"],
      "embedding": [0.12, -0.34, ...]
    }
  ],
  "clusters": [
    {
      "id": 0,
      "size": 15,
      "topics": ["topic1", "topic2", "topic3"],
      "result_indices": [0, 3, 7, 12]
    }
  ]
}

clusters and per-result cluster_id/topics are only present when cluster: true. embedding only present when include_embeddings: true. Score is negative inner product (lower = more similar).

Firehose protocol#

Zstd-compressed NDJSON over HTTP. Each line is a columnar batch:

{"did":["did:plc:abc","did:plc:xyz"],"col":["app.bsky.feed.post","app.bsky.actor.profile"],"rkey":["3abc","self"],"lang":["en","de"],"c":[[128 floats],[128 floats]],"r":[[128 floats],[128 floats]]}

Field	Description
`did`	AT Protocol DID (e.g. `did:plc:abc123`)
`col`	AT Protocol collection NSID (e.g. `app.bsky.feed.post`, `app.bsky.actor.profile`)
`rkey`	Record key within the collection
`lang`	Detected language (`en`, `de`)
`c`	Cluster embedding (prefix `"task: clustering \| query: "`)
`r`	Retrieval embedding (prefix `"title: none \| text: "`)

Empty batches ([]) are heartbeats sent every 10s.

Verify embeddings#

Verify events from the stream against local EmbeddingGemma output:

python3 -m venv .venv && .venv/bin/pip install -r requirements.txt
HF_TOKEN=hf_... .venv/bin/python3 verify.py events.jsonl

Takes JSONL captured with -out (did, col, rkey, lang, c, r). Fetches each record via Bluesky API, prepares text identically to Divepool (posts: text + image/video alt texts + tags; profiles: displayName + description), embeds locally, and compares via cosine similarity. Requires a Hugging Face token for the gated model.

License#

MIT

Clone this repository