Example client for the highly experimental divepool embedding firehose
Python 51.2%
Go 48.8%
Other 0.1%
8 1 0

Clone this repository

https://tangled.org/divepool.social/embedding_firehose_client https://tangled.org/did:plc:lyv4wusy2mfytnmlpkika3r7/embedding_firehose_client
git@tangled.org:divepool.social/embedding_firehose_client git@tangled.org:did:plc:lyv4wusy2mfytnmlpkika3r7/embedding_firehose_client

For self-hosted knots, clone URLs may differ based on your setup.

Download tar.gz
README.md

Divepool Embedding Firehose Client (experimental)#

Example client for the Divepool embedding firehose and search API — a real-time stream of 128d EmbeddingGemma embeddings (Matryoshka truncation, L2-normalized) for Bluesky posts and profiles, plus a semantic search endpoint. Both the API and this client are highly experimental and subject to change.

Stream client#

Real-time embedding stream. Public clients receive 128d vectors; token-authenticated clients receive full 768d.

go run . stream                                 # public endpoint (128d)
go run . stream -token x                        # with bearer token (768d)
go run . stream -token x -out events.jsonl      # write events to file

Without a subcommand, defaults to stream for backwards compatibility.

Search client#

Semantic search across all indexed Bluesky posts. Requires a bearer token.

go run . search -token x "machine learning"                  # basic search (top 20)
go run . search -token x -limit 100 "machine learning"       # more results (max 1000)
go run . search -token x -distinct=false "machine learning"   # multiple posts per account
go run . search -token x -cluster "machine learning"          # with topic clusters

Search API#

POST /api/v1/search with Authorization: Bearer <token>.

Request:

{
  "query": "machine learning",
  "limit": 100,
  "distinct": true,
  "cluster": true
}
Field Type Default Description
query string Search query text (required)
limit int 1000 Max results (1–1000)
distinct bool true One result per account
cluster bool false Enable UMAP+HDBSCAN clustering with c-TF-IDF topic extraction

Response:

{
  "results": [
    {
      "did": "did:plc:abc123",
      "handle": "user.bsky.social",
      "collection": "app.bsky.feed.post",
      "rkey": "3abc",
      "text": "Post text...",
      "score": -0.82,
      "created_at": "2025-01-15T10:30:00Z",
      "detected_lang": "en",
      "cluster_id": 0,
      "topics": ["topic1", "topic2"]
    }
  ],
  "clusters": [
    {
      "id": 0,
      "size": 15,
      "topics": ["topic1", "topic2", "topic3"],
      "result_indices": [0, 3, 7, 12]
    }
  ]
}

clusters and per-result cluster_id/topics are only present when cluster: true. Score is negative inner product (lower = more similar).

Firehose protocol#

Zstd-compressed NDJSON over HTTP. Each line is a columnar batch:

{"did":["did:plc:abc","did:plc:xyz"],"col":["app.bsky.feed.post","app.bsky.actor.profile"],"rkey":["3abc","self"],"lang":["en","de"],"c":[[128 floats],[128 floats]],"r":[[128 floats],[128 floats]]}
Field Description
did AT Protocol DID (e.g. did:plc:abc123)
col AT Protocol collection NSID (e.g. app.bsky.feed.post, app.bsky.actor.profile)
rkey Record key within the collection
lang Detected language (en, de)
c Cluster embedding (prefix "task: clustering | query: ")
r Retrieval embedding (prefix "title: none | text: ")

Empty batches ([]) are heartbeats sent every 10s.

Verify embeddings#

Verify events from the stream against local EmbeddingGemma output:

python3 -m venv .venv && .venv/bin/pip install -r requirements.txt
HF_TOKEN=hf_... .venv/bin/python3 verify.py events.jsonl

Takes JSONL captured with -out (did, col, rkey, lang, c, r). Fetches each record via Bluesky API, prepares text identically to Divepool (posts: text + image/video alt texts + tags; profiles: displayName + description), embeds locally, and compares via cosine similarity. Requires a Hugging Face token for the gated model.

License#

MIT