this repo has no description coral.waow.tech
Zig 59.8%
JavaScript 25.5%
CSS 6.8%
Python 5.4%
HTML 2.2%
Dockerfile 0.3%
Just 0.1%
125 1 0

Clone this repository

https://tangled.org/zzstoatzz.io/coral https://tangled.org/did:plc:xbtmt2zjwlrfegqvch7fboei/coral
git@tangled.org:zzstoatzz.io/coral git@tangled.org:did:plc:xbtmt2zjwlrfegqvch7fboei/coral

For self-hosted knots, clone URLs may differ based on your setup.

Download tar.gz
README.md

coral#

real-time semantic percolation from the Bluesky firehose.

live demo | source

what it does#

extracts named entities (people, organizations, places, events) from Bluesky posts and tracks how they cluster together. entities that get discussed together form edges. when a real-world event spans multiple topics, clusters merge - discourse percolates into a unified conversation.

how it works#

  1. NER bridge consumes the turbostream firehose, runs spaCy NER, extracts entities
  2. labeler integration drops spam before it hits the graph (via Hailey's labeler)
  3. entity graph tracks co-occurrences (entities in same post = edge), computes clusters via union-find
  4. pheromone edges - edge weights decay exponentially, reinforced on repeated co-occurrence (ant colony optimization inspired)
  5. surprise trending - entities ranked by statistical surprise vs baseline (z‑like), not raw counts
  6. frontend visualizes entity activity, cluster structure, and firehose health

theoretical background#

the system draws from several sources:

percolation theory - we use the Newman-Ziff algorithm for efficient cluster detection. on lattices, percolation has a sharp phase transition at p_c ≈ 0.593. our graph isn't a lattice, so we calibrate empirically.

heterogeneous activity - Xie et al. 2021 showed that real social networks percolate at ~1/10th the uniform-theory threshold due to heterogeneous user activity. we weight mentions by user activity rate following this insight.

NER for topic detection - inspired by Hailey's trending topics. rather than embeddings on raw text (too noisy), extract structured entities to reduce surface area.

ATProto labeler system - spam filtering via com.atproto.label. we subscribe to Hailey's labeler stream and drop posts from accounts labeled as spam before NER processing.

design decisions

these are documented as arbitrary choices to be revisited:

decision choice why
edge definition same-post co-occurrence simplest, captures "discussed together"
edge weights pheromone decay (configurable half-life) ant colony inspired, recent co-occurrences matter more
activity threshold 0.01 mentions/sec (~3 per 5 min) rate normalizes across quiet/busy periods
trending metric surprise vs baseline (UI), trend ratio (backend) anomaly detection, not popularity contest
percolation threshold largest_cluster / active > 50% placeholder, needs empirical calibration
entity position hash(text) → (x, y) deterministic, stable, no semantic meaning yet
user weighting planned (currently off) power users count more (Xie 2021)

see docs/02-semantic-percolation-plan.md for full rationale.

stack#

  • ner (python): turbostream consumer + spaCy NER + labeler gate → POST to backend
  • backend (zig): entity graph + websocket server + SQLite persistence
  • site: static html/css/js on cloudflare pages

run locally#

cd backend && zig build run                 # backend (entity graph + websocket)
cd ner && uv run coral-bridge               # NER bridge (turbostream → spaCy → backend)
cd site && npx wrangler pages dev .         # frontend

deploy#

cd backend && fly deploy
cd ner && fly deploy
cd site && npx wrangler pages deploy . --project-name coral

future work#

ideas being explored (not commitments):

  • semantic positioning - currently entities hash to arbitrary grid positions. could use embeddings to place semantically similar entities near each other, making the 2D layout a meaningful projection of topic space. unclear whether to embed entity names, representative posts, or cluster centroids.

  • temporal co-activity edges - entities that spike together might be related even without same-post co-occurrence. "earthquake" and "LA" could both trend during an event without always appearing together.

  • percolation calibration - the 50% threshold is arbitrary. need to correlate cluster merges with real-world events to understand what "discourse unification" actually looks like in the data.

references#