Example client for the highly experimental divepool embedding firehose
Go 53.4%
Python 46.6%
Other 0.1%
10 1 0

Clone this repository

https://tangled.org/divepool.social/embedding_firehose_client https://tangled.org/did:plc:lyv4wusy2mfytnmlpkika3r7/embedding_firehose_client
git@tangled.org:divepool.social/embedding_firehose_client git@tangled.org:did:plc:lyv4wusy2mfytnmlpkika3r7/embedding_firehose_client

For self-hosted knots, clone URLs may differ based on your setup.

Download tar.gz
README.md

Divepool Embedding Firehose Client (experimental)#

Example client for the Divepool embedding firehose and search API — a real-time stream of 128d EmbeddingGemma embeddings (Matryoshka truncation, L2-normalized) for Bluesky posts and profiles, plus a semantic search endpoint. Both the API and this client are highly experimental and subject to change.

Stream client#

Real-time embedding stream. Public clients receive 128d vectors; token-authenticated clients receive full 768d.

go run . stream                                 # public endpoint (128d)
go run . stream -token x                        # with bearer token (768d)
go run . stream -token x -out events.jsonl      # write events to file

Without a subcommand, defaults to stream for backwards compatibility.

Search client#

Semantic search across all indexed Bluesky posts. Bearer token optional — non-bearer gets 128d embeddings, bearer gets 768d.

# Global semantic search
go run . search "machine learning"                                     # public search (top 20)
go run . search -token x "machine learning"                            # bearer search (768d embeddings)
go run . search -limit 100 "machine learning"                          # more results (max 400 global)
go run . search -distinct=false "machine learning"                     # multiple posts per account
go run . search -cluster "machine learning"                            # with topic clusters
go run . search -embeddings "machine learning"                         # include per-result embeddings

# Account-scoped (max 1200)
go run . search -did did:plc:abc123                                    # browse account (newest, clustered)
go run . search -did did:plc:abc123 "machine learning"                 # search within account
go run . search -did did:plc:abc123 -cluster "machine learning"        # search within account + cluster

# Search by example post
go run . search -did did:plc:abc123 -rkey 3abc                         # find similar posts globally
go run . search -did did:plc:abc123 -rkey 3abc -cluster                # find similar + cluster

Search API#

POST /api/v1/search — bearer token optional.

Request:

{
  "query": "machine learning",
  "did": "did:plc:abc123",
  "rkey": "3abc",
  "limit": 100,
  "distinct": true,
  "cluster": true,
  "include_embeddings": true
}
Field Type Default Description
query string Search query text (exclusive with rkey)
did string Scope search to a specific account (AT Protocol DID)
rkey string Use this post's embedding as query (requires did, exclusive with query)
limit int 400/1200 Max results. Global: max 400. DID-scoped: max 1200
distinct bool true One result per account
cluster bool false Enable UMAP+HDBSCAN clustering with c-TF-IDF topic extraction
include_embeddings bool false Include per-result cluster embeddings (128d non-bearer, 768d bearer)

Search modes: At least one of query, did, or rkey is required.

Mode Fields Behavior
Global search query Semantic search across all posts (max 400)
Browse account did Newest posts, always clustered (max 1200)
Search in account did + query Semantic search within account's posts (max 1200)
Similar posts did + rkey Use post's embedding as query for global search

Response:

{
  "results": [
    {
      "did": "did:plc:abc123",
      "handle": "user.bsky.social",
      "collection": "app.bsky.feed.post",
      "rkey": "3abc",
      "text": "Post text...",
      "score": -0.82,
      "created_at": "2025-01-15T10:30:00Z",
      "detected_lang": "en",
      "cluster_id": 0,
      "topics": ["topic1", "topic2"],
      "embedding": [0.12, -0.34, ...]
    }
  ],
  "clusters": [
    {
      "id": 0,
      "size": 15,
      "topics": ["topic1", "topic2", "topic3"],
      "result_indices": [0, 3, 7, 12]
    }
  ]
}

clusters and per-result cluster_id/topics are only present when cluster: true. embedding only present when include_embeddings: true. Score is negative inner product (lower = more similar).

Firehose protocol#

Zstd-compressed NDJSON over HTTP. Each line is a columnar batch:

{"did":["did:plc:abc","did:plc:xyz"],"col":["app.bsky.feed.post","app.bsky.actor.profile"],"rkey":["3abc","self"],"lang":["en","de"],"c":[[128 floats],[128 floats]],"r":[[128 floats],[128 floats]]}
Field Description
did AT Protocol DID (e.g. did:plc:abc123)
col AT Protocol collection NSID (e.g. app.bsky.feed.post, app.bsky.actor.profile)
rkey Record key within the collection
lang Detected language (en, de)
c Cluster embedding (prefix "task: clustering | query: ")
r Retrieval embedding (prefix "title: none | text: ")

Empty batches ([]) are heartbeats sent every 10s.

Verify embeddings#

Verify events from the stream against local EmbeddingGemma output:

python3 -m venv .venv && .venv/bin/pip install -r requirements.txt
HF_TOKEN=hf_... .venv/bin/python3 verify.py events.jsonl

Takes JSONL captured with -out (did, col, rkey, lang, c, r). Fetches each record via Bluesky API, prepares text identically to Divepool (posts: text + image/video alt texts + tags; profiles: displayName + description), embeds locally, and compares via cosine similarity. Requires a Hugging Face token for the gated model.

License#

MIT