A third party ATProto appview

feat: Disable TypeScript backfill, use Python service

Co-authored-by: dollspacegay <dollspacegay@gmail.com>

+169
PYTHON_BACKFILL_GUIDE.md
··· 1 + # Python Backfill Guide 2 + 3 + The AT Protocol backfill functionality is now **exclusively** handled by the Python implementation. TypeScript backfill has been permanently disabled. 4 + 5 + ## Quick Start 6 + 7 + ### 1. Basic Usage 8 + 9 + ```bash 10 + # Backfill last 7 days 11 + BACKFILL_DAYS=7 docker-compose -f docker-compose.python-default.yml up 12 + 13 + # Backfill last 30 days 14 + BACKFILL_DAYS=30 docker-compose -f docker-compose.python-default.yml up 15 + 16 + # Backfill entire history (use with caution - very resource intensive) 17 + BACKFILL_DAYS=-1 docker-compose -f docker-compose.python-default.yml up 18 + ``` 19 + 20 + ### 2. Environment Variables 21 + 22 + The Python backfill service is controlled entirely through environment variables: 23 + 24 + | Variable | Default | Description | 25 + |----------|---------|-------------| 26 + | `BACKFILL_DAYS` | `0` | Number of days to backfill (0=disabled, -1=total history) | 27 + | `BACKFILL_BATCH_SIZE` | `5` | Events processed per batch | 28 + | `BACKFILL_BATCH_DELAY_MS` | `2000` | Delay between batches (milliseconds) | 29 + | `BACKFILL_MAX_CONCURRENT` | `2` | Maximum concurrent operations | 30 + | `BACKFILL_MAX_MEMORY_MB` | `512` | Memory limit before throttling (MB) | 31 + | `BACKFILL_USE_IDLE` | `true` | Use idle processing for better resource sharing | 32 + 33 + ### 3. Resource Profiles 34 + 35 + #### Conservative (Default) 36 + Best for production environments where backfill should not impact live operations: 37 + ```bash 38 + # Uses default conservative settings 39 + BACKFILL_DAYS=30 docker-compose -f docker-compose.python-default.yml up 40 + ``` 41 + 42 + #### Moderate 43 + Balanced performance for dedicated backfill windows: 44 + ```bash 45 + BACKFILL_DAYS=30 \ 46 + BACKFILL_BATCH_SIZE=20 \ 47 + BACKFILL_BATCH_DELAY_MS=500 \ 48 + BACKFILL_MAX_CONCURRENT=5 \ 49 + BACKFILL_MAX_MEMORY_MB=1024 \ 50 + docker-compose -f docker-compose.python-default.yml up 51 + ``` 52 + 53 + #### Aggressive 54 + Maximum speed for dedicated backfill servers: 55 + ```bash 56 + BACKFILL_DAYS=30 \ 57 + BACKFILL_BATCH_SIZE=100 \ 58 + BACKFILL_BATCH_DELAY_MS=100 \ 59 + BACKFILL_MAX_CONCURRENT=10 \ 60 + BACKFILL_MAX_MEMORY_MB=2048 \ 61 + WORKER_MEMORY_LIMIT=8G \ 62 + docker-compose -f docker-compose.python-default.yml up 63 + ``` 64 + 65 + ## How It Works 66 + 67 + 1. **Automatic Start**: When `BACKFILL_DAYS` is set to a non-zero value, the Python unified worker automatically starts the backfill service in the background. 68 + 69 + 2. **Primary Worker Only**: Backfill only runs on the primary worker (WORKER_ID=0) to avoid conflicts. 70 + 71 + 3. **Progress Tracking**: Progress is saved to the database every 1000 events, allowing resume after interruption. 72 + 73 + 4. **Memory Management**: The service monitors memory usage and throttles processing to stay within limits. 74 + 75 + 5. **Concurrent with Live Data**: Backfill runs alongside live firehose processing without interference. 76 + 77 + ## Monitoring 78 + 79 + ### Check Progress 80 + ```bash 81 + # View backfill logs 82 + docker-compose logs python-worker | grep BACKFILL 83 + 84 + # Check database progress 85 + docker-compose exec db psql -U postgres -d atproto -c "SELECT * FROM firehose_cursor WHERE service = 'backfill';" 86 + ``` 87 + 88 + ### Example Log Output 89 + ``` 90 + [BACKFILL] Starting 30-day historical backfill on primary worker... 91 + [BACKFILL] Resource throttling config: 92 + - Batch size: 5 events 93 + - Batch delay: 2000ms 94 + - Max concurrent: 2 95 + - Memory limit: 512MB 96 + - Idle processing: True 97 + [BACKFILL] Progress: 10000 received, 9500 processed, 500 skipped (250 evt/s) 98 + [BACKFILL] Memory: 245MB / 512MB limit 99 + ``` 100 + 101 + ## Troubleshooting 102 + 103 + ### Backfill Not Starting 104 + 105 + 1. Check if `BACKFILL_DAYS` is set: 106 + ```bash 107 + docker-compose exec python-worker env | grep BACKFILL 108 + ``` 109 + 110 + 2. Verify you're on the primary worker: 111 + ```bash 112 + docker-compose exec python-worker env | grep WORKER_ID 113 + ``` 114 + 115 + 3. Check logs for errors: 116 + ```bash 117 + docker-compose logs python-worker | grep -E "BACKFILL|ERROR" 118 + ``` 119 + 120 + ### Performance Issues 121 + 122 + 1. **High Memory Usage**: Reduce `BACKFILL_BATCH_SIZE` and `BACKFILL_MAX_CONCURRENT` 123 + 2. **Slow Progress**: Increase batch size and reduce delay for faster processing 124 + 3. **Database Overload**: Reduce concurrent operations and increase delays 125 + 126 + ### Resume After Interruption 127 + 128 + The backfill automatically resumes from the last saved position. Just restart with the same `BACKFILL_DAYS` value: 129 + 130 + ```bash 131 + # Original run (interrupted) 132 + BACKFILL_DAYS=30 docker-compose -f docker-compose.python-default.yml up 133 + 134 + # Resume (will continue from last position) 135 + BACKFILL_DAYS=30 docker-compose -f docker-compose.python-default.yml up 136 + ``` 137 + 138 + ## TypeScript Backfill Status 139 + 140 + The TypeScript backfill has been permanently disabled: 141 + 142 + - ✅ Code removed from `server/index.ts` 143 + - ✅ API endpoints return 501 Not Implemented 144 + - ✅ Environment variable forced to 0 in all docker-compose files 145 + - ✅ All backfill functionality moved to Python 146 + 147 + ## Migration from TypeScript 148 + 149 + If you were previously using TypeScript backfill: 150 + 151 + 1. Stop all services 152 + 2. Use the new Python-based docker-compose file 153 + 3. Set `BACKFILL_DAYS` as needed 154 + 4. Start services - Python backfill will handle everything 155 + 156 + ## Best Practices 157 + 158 + 1. **Test First**: Start with a small number of days (e.g., 1-7) to test your setup 159 + 2. **Monitor Resources**: Watch memory and CPU usage during initial runs 160 + 3. **Off-Peak Hours**: Run large backfills during low-traffic periods 161 + 4. **Incremental Approach**: For large histories, consider multiple smaller backfills 162 + 5. **Database Maintenance**: Run `VACUUM` and `ANALYZE` after large backfills 163 + 164 + ## Support 165 + 166 + For issues or questions: 167 + 1. Check logs: `docker-compose logs python-worker` 168 + 2. Review this guide 169 + 3. Check `python-firehose/README.backfill.md` for technical details
+113
docker-compose.python-default.yml
··· 1 + # Default Docker Compose - Python Unified Worker with Backfill 2 + # This is the recommended configuration that uses Python for all firehose processing 3 + # TypeScript backfill is permanently disabled 4 + 5 + services: 6 + db: 7 + image: postgres:14 8 + command: postgres -c max_connections=500 -c shared_buffers=20GB -c effective_cache_size=42GB -c work_mem=256MB -c maintenance_work_mem=8GB -c max_parallel_workers=32 -c max_parallel_workers_per_gather=8 -c max_wal_size=8GB 9 + environment: 10 + POSTGRES_DB: atproto 11 + POSTGRES_USER: postgres 12 + POSTGRES_PASSWORD: password 13 + volumes: 14 + - postgres_data:/var/lib/postgresql/data,Z 15 + ports: 16 + - "5432:5432" 17 + healthcheck: 18 + test: ["CMD-SHELL", "pg_isready -U postgres -d atproto"] 19 + interval: 10s 20 + timeout: 5s 21 + retries: 5 22 + restart: unless-stopped 23 + shm_size: 10gb 24 + 25 + # Python Unified Worker - Handles both firehose and backfill 26 + # This is the PRIMARY worker that processes all AT Protocol events 27 + python-worker: 28 + build: 29 + context: ./python-firehose 30 + dockerfile: Dockerfile.unified 31 + environment: 32 + # Core configuration 33 + - RELAY_URL=${RELAY_URL:-wss://bsky.network} 34 + - DATABASE_URL=postgresql://postgres:password@db:5432/atproto 35 + - DB_POOL_SIZE=${DB_POOL_SIZE:-20} 36 + - LOG_LEVEL=${LOG_LEVEL:-INFO} 37 + 38 + # BACKFILL CONFIGURATION 39 + # Set BACKFILL_DAYS to enable historical data import: 40 + # - 0 = disabled (default) 41 + # - -1 = total history (entire available rollback window) 42 + # - N = specific number of days (e.g., 7, 30, 365) 43 + - BACKFILL_DAYS=${BACKFILL_DAYS:-0} 44 + 45 + # Backfill resource throttling (optional - defaults are conservative) 46 + - BACKFILL_BATCH_SIZE=${BACKFILL_BATCH_SIZE:-5} 47 + - BACKFILL_BATCH_DELAY_MS=${BACKFILL_BATCH_DELAY_MS:-2000} 48 + - BACKFILL_MAX_CONCURRENT=${BACKFILL_MAX_CONCURRENT:-2} 49 + - BACKFILL_MAX_MEMORY_MB=${BACKFILL_MAX_MEMORY_MB:-512} 50 + - BACKFILL_USE_IDLE=${BACKFILL_USE_IDLE:-true} 51 + 52 + # Worker identification 53 + - WORKER_ID=0 # Primary worker - runs backfill 54 + depends_on: 55 + db: 56 + condition: service_healthy 57 + healthcheck: 58 + test: ["CMD-SHELL", "python -c \"import asyncpg; import asyncio; asyncio.run(asyncpg.connect('postgresql://postgres:password@db:5432/atproto', timeout=5).close())\" || exit 1"] 59 + interval: 30s 60 + timeout: 10s 61 + start_period: 40s 62 + retries: 3 63 + restart: unless-stopped 64 + deploy: 65 + resources: 66 + limits: 67 + memory: ${WORKER_MEMORY_LIMIT:-4G} 68 + reservations: 69 + memory: 1G 70 + 71 + # Frontend/API server - TypeScript (backfill disabled) 72 + app: 73 + volumes: 74 + - ./appview-signing-key.json:/app/appview-signing-key.json:ro,Z 75 + - ./appview-private.pem:/app/appview-private.pem:ro,Z 76 + - ./public/did.json:/app/public/did.json:ro,Z 77 + - ./oauth-keyset.json:/app/oauth-keyset.json:ro,Z 78 + build: . 79 + ports: 80 + - "5000:5000" 81 + environment: 82 + - DATABASE_URL=postgresql://postgres:password@db:5432/atproto 83 + - SESSION_SECRET=${SESSION_SECRET:-change-this-to-a-random-secret-in-production} 84 + - APPVIEW_DID=${APPVIEW_DID:-did:web:appview.dollspace.gay} 85 + - DATA_RETENTION_DAYS=${DATA_RETENTION_DAYS:-0} 86 + - DB_POOL_SIZE=50 87 + - PORT=5000 88 + - NODE_ENV=production 89 + - OAUTH_KEYSET_PATH=/app/oauth-keyset.json 90 + - ADMIN_DIDS=${ADMIN_DIDS:-did:plc:abc123xyz,admin.bsky.social,did:plc:def456uvw} 91 + 92 + # TypeScript firehose and backfill are PERMANENTLY DISABLED 93 + - FIREHOSE_ENABLED=false 94 + - BACKFILL_DAYS=0 # Force disabled - Python handles all backfill 95 + depends_on: 96 + db: 97 + condition: service_healthy 98 + healthcheck: 99 + test: ["CMD-SHELL", "node -e \"require('http').get('http://localhost:5000/health', (r) => {process.exit(r.statusCode === 200 ? 0 : 1)})\""] 100 + interval: 30s 101 + timeout: 10s 102 + start_period: 40s 103 + retries: 3 104 + restart: unless-stopped 105 + deploy: 106 + resources: 107 + limits: 108 + memory: 2G 109 + reservations: 110 + memory: 512M 111 + 112 + volumes: 113 + postgres_data:
+59
migrate-to-python-backfill.sh
··· 1 + #!/bin/bash 2 + # Migration script from TypeScript to Python backfill 3 + 4 + echo "==========================================" 5 + echo "Migrating to Python Backfill Service" 6 + echo "==========================================" 7 + echo "" 8 + echo "This script will help you migrate from the TypeScript backfill" 9 + echo "to the new Python backfill service." 10 + echo "" 11 + 12 + # Check if docker-compose is running 13 + if docker-compose ps | grep -q "Up"; then 14 + echo "⚠️ WARNING: Docker services are currently running." 15 + echo "Please stop them first with: docker-compose down" 16 + exit 1 17 + fi 18 + 19 + # Create backup of current docker-compose.yml 20 + if [ -f docker-compose.yml ]; then 21 + echo "📁 Backing up current docker-compose.yml to docker-compose.yml.backup" 22 + cp docker-compose.yml docker-compose.yml.backup 23 + fi 24 + 25 + # Use the Python default configuration 26 + echo "📝 Setting up Python-based configuration..." 27 + cp docker-compose.python-default.yml docker-compose.yml 28 + 29 + echo "" 30 + echo "✅ Migration complete!" 31 + echo "" 32 + echo "==========================================" 33 + echo "How to use Python backfill:" 34 + echo "==========================================" 35 + echo "" 36 + echo "1. Start with backfill disabled (default):" 37 + echo " docker-compose up" 38 + echo "" 39 + echo "2. Enable backfill for specific days:" 40 + echo " BACKFILL_DAYS=7 docker-compose up" 41 + echo "" 42 + echo "3. Enable total history backfill:" 43 + echo " BACKFILL_DAYS=-1 docker-compose up" 44 + echo "" 45 + echo "4. Customize resources (example):" 46 + echo " BACKFILL_DAYS=30 \\" 47 + echo " BACKFILL_BATCH_SIZE=20 \\" 48 + echo " BACKFILL_MAX_MEMORY_MB=1024 \\" 49 + echo " docker-compose up" 50 + echo "" 51 + echo "==========================================" 52 + echo "Important changes:" 53 + echo "==========================================" 54 + echo "✅ TypeScript backfill is permanently disabled" 55 + echo "✅ Python worker handles all backfill operations" 56 + echo "✅ Same BACKFILL_DAYS environment variable" 57 + echo "✅ Progress saved to database for resume support" 58 + echo "" 59 + echo "For more information, see PYTHON_BACKFILL_GUIDE.md"
+12 -73
server/routes.ts
··· 693 693 } 694 694 }); 695 695 696 + // TypeScript backfill is PERMANENTLY DISABLED 697 + // Use Python backfill service instead: python-firehose/backfill_service.py 696 698 app.post("/api/user/backfill", csrfProtection.validateToken, requireAuth, async (req: AuthRequest, res) => { 697 - try { 698 - if (!req.session) { 699 - return res.status(401).json({ error: "Not authenticated" }); 700 - } 701 - 702 - const session = await storage.getSession(req.session.sessionId); 703 - if (!session) { 704 - return res.status(404).json({ error: "Session not found" }); 705 - } 706 - 707 - const schema = z.object({ 708 - days: z.number().min(0).max(3650), // 0 = all data, max 10 years 709 - }); 710 - 711 - const data = schema.parse(req.body); 712 - const userDid = session.userDid; 713 - 714 - if (data.days === 0 || data.days > 3) { 715 - const { repoBackfillService } = await import("./services/repo-backfill"); 716 - 717 - repoBackfillService.backfillSingleRepo(userDid, data.days).then(() => { 718 - console.log(`[USER_BACKFILL] Completed repository backfill for ${userDid}`); 719 - }).catch((error: Error) => { 720 - console.error(`[USER_BACKFILL] Failed repository backfill for ${userDid}:`, error); 721 - }); 722 - 723 - const message = data.days === 0 724 - ? `Backfill started for ALL your data. Your complete repository is being imported from your PDS.` 725 - : `Backfill started for the last ${data.days} days. Your repository is being imported from your PDS.`; 726 - 727 - res.json({ 728 - message, 729 - type: "repository" 730 - }); 731 - } else { 732 - res.json({ 733 - message: `Recent data backfill (${data.days} days) will be handled by the firehose.`, 734 - type: "firehose" 735 - }); 736 - } 737 - 738 - await storage.updateUserSettings(userDid, { 739 - lastBackfillAt: new Date(), 740 - }); 741 - } catch (error) { 742 - console.error("[USER_BACKFILL] Error:", error); 743 - res.status(400).json({ error: error instanceof Error ? error.message : "Failed to start backfill" }); 744 - } 699 + res.status(501).json({ 700 + error: "TypeScript backfill has been disabled. Please use the Python backfill service instead.", 701 + info: "Set BACKFILL_DAYS environment variable and run the Python unified worker." 702 + }); 745 703 }); 746 704 747 705 app.post("/api/user/delete-data", deletionLimiter, csrfProtection.validateToken, requireAuth, async (req: AuthRequest, res) => { ··· 2182 2140 } 2183 2141 }); 2184 2142 2185 - // Backfill test endpoint - backfill a single repository 2143 + // TypeScript backfill test endpoint is PERMANENTLY DISABLED 2144 + // Use Python backfill service instead 2186 2145 app.post("/api/backfill/repo", async (req, res) => { 2187 - try { 2188 - const schema = z.object({ 2189 - did: z.string(), 2190 - }); 2191 - 2192 - const data = schema.parse(req.body); 2193 - const { repoBackfillService } = await import("./services/repo-backfill"); 2194 - 2195 - console.log(`[API] Starting repo backfill for ${data.did}...`); 2196 - // Skip date check for test endpoint to allow testing even when BACKFILL_DAYS=0 2197 - await repoBackfillService.backfillSingleRepo(data.did); 2198 - 2199 - const progress = repoBackfillService.getProgress(); 2200 - res.json({ 2201 - success: true, 2202 - did: data.did, 2203 - progress 2204 - }); 2205 - } catch (error) { 2206 - console.error("[API] Repo backfill error:", error); 2207 - res.status(500).json({ 2208 - error: error instanceof Error ? error.message : "Failed to backfill repo" 2209 - }); 2210 - } 2146 + res.status(501).json({ 2147 + error: "TypeScript backfill has been disabled. Please use the Python backfill service instead.", 2148 + info: "Set BACKFILL_DAYS environment variable and run the Python unified worker." 2149 + }); 2211 2150 }); 2212 2151 2213 2152 // XRPC API Endpoints