Divepool Embedding Firehose Client (experimental)#
Example client for the Divepool embedding firehose and search API — a real-time stream of 128d EmbeddingGemma embeddings (Matryoshka truncation, L2-normalized) for Bluesky posts and profiles, plus a semantic search endpoint. Both the API and this client are highly experimental and subject to change.
Stream client#
Real-time embedding stream. Public clients receive 128d vectors; token-authenticated clients receive full 768d.
go run . stream # public endpoint (128d)
go run . stream -token x # with bearer token (768d)
go run . stream -token x -out events.jsonl # write events to file
Without a subcommand, defaults to stream for backwards compatibility.
Search client#
Semantic search across all indexed Bluesky posts. Requires a bearer token.
go run . search -token x "machine learning" # basic search (top 20)
go run . search -token x -limit 100 "machine learning" # more results (max 1000)
go run . search -token x -distinct=false "machine learning" # multiple posts per account
go run . search -token x -cluster "machine learning" # with topic clusters
Search API#
POST /api/v1/search with Authorization: Bearer <token>.
Request:
{
"query": "machine learning",
"limit": 100,
"distinct": true,
"cluster": true
}
| Field | Type | Default | Description |
|---|---|---|---|
query |
string | — | Search query text (required) |
limit |
int | 1000 | Max results (1–1000) |
distinct |
bool | true | One result per account |
cluster |
bool | false | Enable UMAP+HDBSCAN clustering with c-TF-IDF topic extraction |
Response:
{
"results": [
{
"did": "did:plc:abc123",
"handle": "user.bsky.social",
"collection": "app.bsky.feed.post",
"rkey": "3abc",
"text": "Post text...",
"score": -0.82,
"created_at": "2025-01-15T10:30:00Z",
"detected_lang": "en",
"cluster_id": 0,
"topics": ["topic1", "topic2"]
}
],
"clusters": [
{
"id": 0,
"size": 15,
"topics": ["topic1", "topic2", "topic3"],
"result_indices": [0, 3, 7, 12]
}
]
}
clusters and per-result cluster_id/topics are only present when cluster: true. Score is negative inner product (lower = more similar).
Firehose protocol#
Zstd-compressed NDJSON over HTTP. Each line is a columnar batch:
{"did":["did:plc:abc","did:plc:xyz"],"col":["app.bsky.feed.post","app.bsky.actor.profile"],"rkey":["3abc","self"],"lang":["en","de"],"c":[[128 floats],[128 floats]],"r":[[128 floats],[128 floats]]}
| Field | Description |
|---|---|
did |
AT Protocol DID (e.g. did:plc:abc123) |
col |
AT Protocol collection NSID (e.g. app.bsky.feed.post, app.bsky.actor.profile) |
rkey |
Record key within the collection |
lang |
Detected language (en, de) |
c |
Cluster embedding (prefix "task: clustering | query: ") |
r |
Retrieval embedding (prefix "title: none | text: ") |
Empty batches ([]) are heartbeats sent every 10s.
Verify embeddings#
Verify events from the stream against local EmbeddingGemma output:
python3 -m venv .venv && .venv/bin/pip install -r requirements.txt
HF_TOKEN=hf_... .venv/bin/python3 verify.py events.jsonl
Takes JSONL captured with -out (did, col, rkey, lang, c, r). Fetches each record via Bluesky API, prepares text identically to Divepool (posts: text + image/video alt texts + tags; profiles: displayName + description), embeds locally, and compares via cosine similarity. Requires a Hugging Face token for the gated model.
License#
MIT