Divepool Embedding Firehose Client (experimental)#
Example client for the Divepool embedding firehose and search API — a real-time stream of 128d EmbeddingGemma embeddings (Matryoshka truncation, L2-normalized) for Bluesky posts and profiles, plus a semantic search endpoint. Both the API and this client are highly experimental and subject to change.
Stream client#
Real-time embedding stream. Public clients receive 128d vectors; token-authenticated clients receive full 768d.
go run . stream # public endpoint (128d)
go run . stream -token x # with bearer token (768d)
go run . stream -token x -out events.jsonl # write events to file
Without a subcommand, defaults to stream for backwards compatibility.
Search client#
Semantic search across all indexed Bluesky posts. Bearer token optional — non-bearer gets 128d embeddings, bearer gets 768d.
# Global semantic search
go run . search "machine learning" # public search (top 20)
go run . search -token x "machine learning" # bearer search (768d embeddings)
go run . search -limit 100 "machine learning" # more results (max 400 global)
go run . search -distinct=false "machine learning" # multiple posts per account
go run . search -cluster "machine learning" # with topic clusters
go run . search -embeddings "machine learning" # include per-result embeddings
# Account-scoped (max 1200)
go run . search -did did:plc:abc123 # browse account (newest, clustered)
go run . search -did did:plc:abc123 "machine learning" # search within account
go run . search -did did:plc:abc123 -cluster "machine learning" # search within account + cluster
# Search by example post
go run . search -did did:plc:abc123 -rkey 3abc # find similar posts globally
go run . search -did did:plc:abc123 -rkey 3abc -cluster # find similar + cluster
Search API#
POST /api/v1/search — bearer token optional.
Request:
{
"query": "machine learning",
"did": "did:plc:abc123",
"rkey": "3abc",
"limit": 100,
"distinct": true,
"cluster": true,
"include_embeddings": true
}
| Field | Type | Default | Description |
|---|---|---|---|
query |
string | — | Search query text (exclusive with rkey) |
did |
string | — | Scope search to a specific account (AT Protocol DID) |
rkey |
string | — | Use this post's embedding as query (requires did, exclusive with query) |
limit |
int | 400/1200 | Max results. Global: max 400. DID-scoped: max 1200 |
distinct |
bool | true | One result per account |
cluster |
bool | false | Enable UMAP+HDBSCAN clustering with c-TF-IDF topic extraction |
include_embeddings |
bool | false | Include per-result cluster embeddings (128d non-bearer, 768d bearer) |
Search modes: At least one of query, did, or rkey is required.
| Mode | Fields | Behavior |
|---|---|---|
| Global search | query |
Semantic search across all posts (max 400) |
| Browse account | did |
Newest posts, always clustered (max 1200) |
| Search in account | did + query |
Semantic search within account's posts (max 1200) |
| Similar posts | did + rkey |
Use post's embedding as query for global search |
Response:
{
"results": [
{
"did": "did:plc:abc123",
"handle": "user.bsky.social",
"collection": "app.bsky.feed.post",
"rkey": "3abc",
"text": "Post text...",
"score": -0.82,
"created_at": "2025-01-15T10:30:00Z",
"detected_lang": "en",
"cluster_id": 0,
"topics": ["topic1", "topic2"],
"embedding": [0.12, -0.34, ...]
}
],
"clusters": [
{
"id": 0,
"size": 15,
"topics": ["topic1", "topic2", "topic3"],
"result_indices": [0, 3, 7, 12]
}
]
}
clusters and per-result cluster_id/topics are only present when cluster: true. embedding only present when include_embeddings: true. Score is negative inner product (lower = more similar).
Firehose protocol#
Zstd-compressed NDJSON over HTTP. Each line is a columnar batch:
{"did":["did:plc:abc","did:plc:xyz"],"col":["app.bsky.feed.post","app.bsky.actor.profile"],"rkey":["3abc","self"],"lang":["en","de"],"c":[[128 floats],[128 floats]],"r":[[128 floats],[128 floats]]}
| Field | Description |
|---|---|
did |
AT Protocol DID (e.g. did:plc:abc123) |
col |
AT Protocol collection NSID (e.g. app.bsky.feed.post, app.bsky.actor.profile) |
rkey |
Record key within the collection |
lang |
Detected language (en, de) |
c |
Cluster embedding (prefix "task: clustering | query: ") |
r |
Retrieval embedding (prefix "title: none | text: ") |
Empty batches ([]) are heartbeats sent every 10s.
Verify embeddings#
Verify events from the stream against local EmbeddingGemma output:
python3 -m venv .venv && .venv/bin/pip install -r requirements.txt
HF_TOKEN=hf_... .venv/bin/python3 verify.py events.jsonl
Takes JSONL captured with -out (did, col, rkey, lang, c, r). Fetches each record via Bluesky API, prepares text identically to Divepool (posts: text + image/video alt texts + tags; profiles: displayName + description), embeds locally, and compares via cosine similarity. Requires a Hugging Face token for the gated model.
License#
MIT