A tool for tailing a labelers' firehose, rehydrating, and storing records for future analysis of moderation decisions.

docs: add profile blob hydration implementation notes

Comprehensive documentation of key learnings including:
- CID deserialization quirks in @atproto/api
- PDS endpoint resolution via PLC directory
- Proper blob fetching with AT Protocol API
- Database schema design for profile vs post blobs
- Change tracking and re-hydration logic
- Common errors and solutions

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

Changed files
+164
docs
+164
docs/profile-blob-hydration.md
··· 1 + # Profile Blob Hydration - Implementation Notes 2 + 3 + ## Overview 4 + 5 + This document captures key learnings from implementing avatar and banner blob hydration for Bluesky profiles. 6 + 7 + ## Key Discoveries 8 + 9 + ### 1. CID Deserialization in @atproto/api 10 + 11 + The `@atproto/api` library deserializes blob references from their JSON `$link` representation into CID class objects. 12 + 13 + **Raw JSON from API:** 14 + ```json 15 + { 16 + "avatar": { 17 + "$type": "blob", 18 + "ref": { 19 + "$link": "bafkreigg3s6plegjncmxubeufbohj3qasbm4r23q2x7zlivdhccfqfypve" 20 + }, 21 + "mimeType": "image/jpeg", 22 + "size": 101770 23 + } 24 + } 25 + ``` 26 + 27 + **What you get in TypeScript:** 28 + ```typescript 29 + record.avatar.ref // CID object with { code, version, hash, ... } 30 + ``` 31 + 32 + **Solution:** 33 + ```typescript 34 + const cid = record.avatar.ref.toString(); // "bafkrei..." 35 + ``` 36 + 37 + ### 2. PDS Endpoint Resolution 38 + 39 + Users can be on different Personal Data Servers (PDS), not just `bsky.social`. Blobs must be fetched from the user's actual PDS. 40 + 41 + **Process:** 42 + 1. Query PLC directory for DID document: `https://plc.wtf/${did}` 43 + 2. Find service with `id: "#atproto_pds"` and `type: "AtprotoPersonalDataServer"` 44 + 3. Extract `serviceEndpoint` URL 45 + 4. Use that endpoint for `com.atproto.sync.getBlob` 46 + 47 + **Example:** 48 + ```typescript 49 + const didDoc = await fetch(`https://plc.wtf/${did}`).then(r => r.json()); 50 + const pdsService = didDoc.service?.find(s => 51 + s.id === "#atproto_pds" && s.type === "AtprotoPersonalDataServer" 52 + ); 53 + const pdsEndpoint = pdsService.serviceEndpoint; // e.g., "https://waxcap.us-west.host.bsky.network" 54 + ``` 55 + 56 + ### 3. Correct Blob Fetching 57 + 58 + **Don't use CDN paths** - they don't work reliably for all blobs and require authentication context. 59 + 60 + **Use the AT Protocol API:** 61 + ```typescript 62 + const blobUrl = `${pdsEndpoint}/xrpc/com.atproto.sync.getBlob?did=${did}&cid=${cid}`; 63 + const response = await fetch(blobUrl); 64 + const blobData = Buffer.from(await response.arrayBuffer()); 65 + ``` 66 + 67 + ### 4. Database Schema Design 68 + 69 + **Separate tables for different blob types:** 70 + 71 + - `blobs` table: Post images with FK to `posts(uri)` 72 + - `profile_blobs` table: Avatars/banners with FK to `profiles(did)` 73 + 74 + This allows proper relational queries and analysis. 75 + 76 + **Profile blobs schema:** 77 + ```sql 78 + CREATE TABLE profile_blobs ( 79 + did TEXT NOT NULL, 80 + blob_type TEXT NOT NULL CHECK (blob_type IN ('avatar', 'banner')), 81 + blob_cid TEXT NOT NULL, 82 + sha256 TEXT NOT NULL, 83 + phash TEXT, 84 + storage_path TEXT, 85 + mimetype TEXT, 86 + captured_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP, 87 + PRIMARY KEY (did, blob_type, captured_at), 88 + FOREIGN KEY (did) REFERENCES profiles(did) 89 + ); 90 + ``` 91 + 92 + ### 5. Change Tracking 93 + 94 + Including `captured_at` in the primary key allows tracking when users change their avatars/banners. 95 + 96 + **Query latest state:** 97 + ```sql 98 + SELECT * FROM profile_blobs 99 + WHERE did = ? AND blob_type = ? 100 + ORDER BY captured_at DESC 101 + LIMIT 1 102 + ``` 103 + 104 + **Only insert if changed:** 105 + ```typescript 106 + const latest = await findLatestByDidAndType(did, type); 107 + if (latest && latest.blob_cid === cid) { 108 + return; // No change, skip 109 + } 110 + // Insert new row with current timestamp 111 + ``` 112 + 113 + ### 6. Sentinel Values for Missing Data 114 + 115 + Use empty string (`""`) to distinguish "we checked, user has no avatar" from NULL "we haven't checked yet". 116 + 117 + ```typescript 118 + if (record.avatar?.ref) { 119 + avatarCid = record.avatar.ref.toString(); 120 + } else { 121 + avatarCid = ""; // Explicitly checked, not present 122 + } 123 + ``` 124 + 125 + This prevents infinite re-hydration loops for profiles without avatars. 126 + 127 + ### 7. Profile Re-hydration Logic 128 + 129 + ```typescript 130 + const existingProfile = await findByDid(did); 131 + const needsRehydration = existingProfile && 132 + (existingProfile.avatar_cid === null || existingProfile.banner_cid === null); 133 + 134 + if (existingProfile && !needsRehydration) { 135 + return; // Skip 136 + } 137 + ``` 138 + 139 + ## Configuration 140 + 141 + - `PLC_ENDPOINT`: DID resolution endpoint (default: `https://plc.wtf`) 142 + - Can be changed to `https://plc.directory` or custom instance 143 + - plc.wtf is faster but unofficial 144 + 145 + ## Common Errors 146 + 147 + ### "RepoNotFound" 148 + - **Cause:** Querying wrong PDS endpoint 149 + - **Solution:** Resolve correct PDS from DID document 150 + 151 + ### Foreign Key Constraint Violation 152 + - **Cause:** Trying to insert profile blobs into `blobs` table 153 + - **Solution:** Use separate `profile_blobs` table 154 + 155 + ### Missing CIDs Despite API Returning Them 156 + - **Cause:** Trying to access `ref.$link` when ref is a CID object 157 + - **Solution:** Call `.toString()` on the CID object 158 + 159 + ## Related Files 160 + 161 + - `src/hydration/profiles.service.ts` - Main hydration logic 162 + - `src/database/profile-blobs.repository.ts` - Profile blob persistence 163 + - `src/database/schema.ts` - Table definitions 164 + - `src/config/index.ts` - PLC endpoint configuration