A tool for tailing a labelers' firehose, rehydrating, and storing records for future analysis of moderation decisions.
4
fork

Configure Feed

Select the types of activity you want to include in your feed.

docs: update README for phases 4-5 completion

- Updated architecture diagram to include blob processing flow
- Added blob processor and rate limiter to components list
- Removed "Phase 4" markers from configuration options
- Updated project structure to show blobs/ and utils/ directories
- Marked phases 4-5 as complete in roadmap
- Added Rate Limiting section explaining p-ratelimit behavior
- Added blob processing log event to monitoring section

๐Ÿค– Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

+29 -10
+29 -10
README.md
··· 6 6 7 7 - **Real-time Label Capture**: Subscribe to any Bluesky labeler's firehose via WebSocket 8 8 - **Automatic Content Hydration**: Fetch full post records and user profiles for labeled content 9 + - **Blob Processing**: SHA-256 and perceptual hashing for images/videos with optional download 9 10 - **Intelligent Filtering**: Optionally filter labels by type to capture only what you need 11 + - **Rate Limiting**: Respects Bluesky API limits (3000 req/5min) with p-ratelimit 12 + - **Retry Logic**: Automatic retry with exponential backoff for transient failures 10 13 - **Cursor Persistence**: Resume from where you left off after restarts 11 - - **Automatic Reconnection**: Exponential backoff reconnection (1s-30s) for stability 14 + - **Automatic Reconnection**: Exponential backoff reconnection (1s-60s) for stability 12 15 - **DuckDB Storage**: Embedded analytics database optimized for ML pipelines 13 16 - **Docker Ready**: Containerized deployment with volume persistence 14 17 - **Type-Safe**: Full TypeScript implementation with Zod validation ··· 18 21 ``` 19 22 Firehose โ†’ Label Event โ†’ Filter โ†’ Store Label โ†’ Hydration Queue 20 23 โ†“ โ†“ 21 - DuckDB โ† [Post/Profile Fetch] 24 + DuckDB โ† [Post/Profile Fetch] โ†’ Blob Processing 25 + โ†“ 26 + Hash + Store 22 27 ``` 23 28 24 29 ### Components 25 30 26 31 - **Firehose Subscriber**: WebSocket client with DAG-CBOR decoding 27 32 - **Label Filter**: Configurable allow-list for label types 28 - - **Hydration Services**: Automatic post and profile data fetching 33 + - **Hydration Services**: Automatic post and profile data fetching with rate limiting 34 + - **Blob Processor**: SHA-256 and perceptual hash computation with optional download 29 35 - **Hydration Queue**: Async queue with deduplication 36 + - **Rate Limiter**: p-ratelimit enforcing 3000 requests per 5 minutes 37 + - **Retry Logic**: Exponential backoff for transient failures 30 38 - **Repository Layer**: Clean database abstraction for all entities 31 39 32 40 ## Quick Start ··· 104 112 - `CAPTURE_LABELS`: Comma-separated list of label values to capture 105 113 - `DB_PATH`: Path to DuckDB database file (default: `./data/skywatch.duckdb`) 106 114 - `LOG_LEVEL`: Logging level (default: `info`) 107 - - `HYDRATE_BLOBS`: Enable blob download (default: `false`) - Phase 4 108 - - `BLOB_STORAGE_TYPE`: Storage backend for blobs (`local` or `s3`) - Phase 4 109 - - `BLOB_STORAGE_PATH`: Local path for blob storage - Phase 4 115 + - `HYDRATE_BLOBS`: Enable blob download (default: `false`) 116 + - `BLOB_STORAGE_TYPE`: Storage backend for blobs (`local` or `s3`) 117 + - `BLOB_STORAGE_PATH`: Local path for blob storage (default: `./data/blobs`) 110 118 111 - ### S3 Configuration (Phase 4, Optional) 119 + ### S3 Configuration (Optional) 112 120 113 121 - `S3_BUCKET`: S3 bucket name 114 122 - `S3_REGION`: AWS region ··· 150 158 - `display_name`: Display name 151 159 - `description`: Bio/description 152 160 153 - ### Blobs Table (Phase 4) 161 + ### Blobs Table 154 162 Image and video blob metadata. 155 163 156 164 - `post_uri`: Associated post URI ··· 188 196 - `Received label`: Label captured and stored 189 197 - `Post hydrated successfully`: Post data fetched 190 198 - `Profile hydrated successfully`: Profile data fetched 199 + - `Blob processed`: Blob hashed and optionally stored 200 + 201 + ## Rate Limiting 202 + 203 + The service implements p-ratelimit to respect Bluesky's API limits: 204 + - **Limit**: 3000 requests per 5 minutes per IP address 205 + - **Concurrency**: Up to 48 concurrent requests 206 + - **Backoff**: Automatic delays when approaching limits 207 + - **Retry Logic**: Exponential backoff for rate limit errors (1s-10s) 191 208 192 209 ## Development 193 210 ··· 196 213 ``` 197 214 skywatch-tail/ 198 215 โ”œโ”€โ”€ src/ 216 + โ”‚ โ”œโ”€โ”€ blobs/ # Blob processing and storage 199 217 โ”‚ โ”œโ”€โ”€ config/ # Environment validation 200 218 โ”‚ โ”œโ”€โ”€ database/ # Schema and repositories 201 219 โ”‚ โ”œโ”€โ”€ firehose/ # WebSocket subscriber 202 220 โ”‚ โ”œโ”€โ”€ hydration/ # Content hydration services 203 221 โ”‚ โ”œโ”€โ”€ logger/ # Pino logger setup 222 + โ”‚ โ”œโ”€โ”€ utils/ # Retry logic and helpers 204 223 โ”‚ โ””โ”€โ”€ index.ts # Main entry point 205 224 โ”œโ”€โ”€ tests/ 206 225 โ”‚ โ”œโ”€โ”€ integration/ # Database integration tests ··· 244 263 - [x] Phase 1: Core infrastructure (Docker, config, database, logging) 245 264 - [x] Phase 2: Firehose connection and label capture 246 265 - [x] Phase 3: Content hydration (posts and profiles) 247 - - [ ] Phase 4: Blob processing (image/video hashing and storage) 248 - - [ ] Phase 5: Rate limiting and optimization 266 + - [x] Phase 4: Blob processing (image/video hashing and storage) 267 + - [x] Phase 5: Rate limiting and optimization 249 268 - [ ] Phase 6: Comprehensive testing 250 269 - [ ] Phase 7: Documentation 251 270