+169
PYTHON_BACKFILL_GUIDE.md
+169
PYTHON_BACKFILL_GUIDE.md
···
1
+
# Python Backfill Guide
2
+
3
+
The AT Protocol backfill functionality is now **exclusively** handled by the Python implementation. TypeScript backfill has been permanently disabled.
4
+
5
+
## Quick Start
6
+
7
+
### 1. Basic Usage
8
+
9
+
```bash
10
+
# Backfill last 7 days
11
+
BACKFILL_DAYS=7 docker-compose -f docker-compose.python-default.yml up
12
+
13
+
# Backfill last 30 days
14
+
BACKFILL_DAYS=30 docker-compose -f docker-compose.python-default.yml up
15
+
16
+
# Backfill entire history (use with caution - very resource intensive)
17
+
BACKFILL_DAYS=-1 docker-compose -f docker-compose.python-default.yml up
18
+
```
19
+
20
+
### 2. Environment Variables
21
+
22
+
The Python backfill service is controlled entirely through environment variables:
23
+
24
+
| Variable | Default | Description |
25
+
|----------|---------|-------------|
26
+
| `BACKFILL_DAYS` | `0` | Number of days to backfill (0=disabled, -1=total history) |
27
+
| `BACKFILL_BATCH_SIZE` | `5` | Events processed per batch |
28
+
| `BACKFILL_BATCH_DELAY_MS` | `2000` | Delay between batches (milliseconds) |
29
+
| `BACKFILL_MAX_CONCURRENT` | `2` | Maximum concurrent operations |
30
+
| `BACKFILL_MAX_MEMORY_MB` | `512` | Memory limit before throttling (MB) |
31
+
| `BACKFILL_USE_IDLE` | `true` | Use idle processing for better resource sharing |
32
+
33
+
### 3. Resource Profiles
34
+
35
+
#### Conservative (Default)
36
+
Best for production environments where backfill should not impact live operations:
37
+
```bash
38
+
# Uses default conservative settings
39
+
BACKFILL_DAYS=30 docker-compose -f docker-compose.python-default.yml up
40
+
```
41
+
42
+
#### Moderate
43
+
Balanced performance for dedicated backfill windows:
44
+
```bash
45
+
BACKFILL_DAYS=30 \
46
+
BACKFILL_BATCH_SIZE=20 \
47
+
BACKFILL_BATCH_DELAY_MS=500 \
48
+
BACKFILL_MAX_CONCURRENT=5 \
49
+
BACKFILL_MAX_MEMORY_MB=1024 \
50
+
docker-compose -f docker-compose.python-default.yml up
51
+
```
52
+
53
+
#### Aggressive
54
+
Maximum speed for dedicated backfill servers:
55
+
```bash
56
+
BACKFILL_DAYS=30 \
57
+
BACKFILL_BATCH_SIZE=100 \
58
+
BACKFILL_BATCH_DELAY_MS=100 \
59
+
BACKFILL_MAX_CONCURRENT=10 \
60
+
BACKFILL_MAX_MEMORY_MB=2048 \
61
+
WORKER_MEMORY_LIMIT=8G \
62
+
docker-compose -f docker-compose.python-default.yml up
63
+
```
64
+
65
+
## How It Works
66
+
67
+
1. **Automatic Start**: When `BACKFILL_DAYS` is set to a non-zero value, the Python unified worker automatically starts the backfill service in the background.
68
+
69
+
2. **Primary Worker Only**: Backfill only runs on the primary worker (WORKER_ID=0) to avoid conflicts.
70
+
71
+
3. **Progress Tracking**: Progress is saved to the database every 1000 events, allowing resume after interruption.
72
+
73
+
4. **Memory Management**: The service monitors memory usage and throttles processing to stay within limits.
74
+
75
+
5. **Concurrent with Live Data**: Backfill runs alongside live firehose processing without interference.
76
+
77
+
## Monitoring
78
+
79
+
### Check Progress
80
+
```bash
81
+
# View backfill logs
82
+
docker-compose logs python-worker | grep BACKFILL
83
+
84
+
# Check database progress
85
+
docker-compose exec db psql -U postgres -d atproto -c "SELECT * FROM firehose_cursor WHERE service = 'backfill';"
86
+
```
87
+
88
+
### Example Log Output
89
+
```
90
+
[BACKFILL] Starting 30-day historical backfill on primary worker...
91
+
[BACKFILL] Resource throttling config:
92
+
- Batch size: 5 events
93
+
- Batch delay: 2000ms
94
+
- Max concurrent: 2
95
+
- Memory limit: 512MB
96
+
- Idle processing: True
97
+
[BACKFILL] Progress: 10000 received, 9500 processed, 500 skipped (250 evt/s)
98
+
[BACKFILL] Memory: 245MB / 512MB limit
99
+
```
100
+
101
+
## Troubleshooting
102
+
103
+
### Backfill Not Starting
104
+
105
+
1. Check if `BACKFILL_DAYS` is set:
106
+
```bash
107
+
docker-compose exec python-worker env | grep BACKFILL
108
+
```
109
+
110
+
2. Verify you're on the primary worker:
111
+
```bash
112
+
docker-compose exec python-worker env | grep WORKER_ID
113
+
```
114
+
115
+
3. Check logs for errors:
116
+
```bash
117
+
docker-compose logs python-worker | grep -E "BACKFILL|ERROR"
118
+
```
119
+
120
+
### Performance Issues
121
+
122
+
1. **High Memory Usage**: Reduce `BACKFILL_BATCH_SIZE` and `BACKFILL_MAX_CONCURRENT`
123
+
2. **Slow Progress**: Increase batch size and reduce delay for faster processing
124
+
3. **Database Overload**: Reduce concurrent operations and increase delays
125
+
126
+
### Resume After Interruption
127
+
128
+
The backfill automatically resumes from the last saved position. Just restart with the same `BACKFILL_DAYS` value:
129
+
130
+
```bash
131
+
# Original run (interrupted)
132
+
BACKFILL_DAYS=30 docker-compose -f docker-compose.python-default.yml up
133
+
134
+
# Resume (will continue from last position)
135
+
BACKFILL_DAYS=30 docker-compose -f docker-compose.python-default.yml up
136
+
```
137
+
138
+
## TypeScript Backfill Status
139
+
140
+
The TypeScript backfill has been permanently disabled:
141
+
142
+
- ✅ Code removed from `server/index.ts`
143
+
- ✅ API endpoints return 501 Not Implemented
144
+
- ✅ Environment variable forced to 0 in all docker-compose files
145
+
- ✅ All backfill functionality moved to Python
146
+
147
+
## Migration from TypeScript
148
+
149
+
If you were previously using TypeScript backfill:
150
+
151
+
1. Stop all services
152
+
2. Use the new Python-based docker-compose file
153
+
3. Set `BACKFILL_DAYS` as needed
154
+
4. Start services - Python backfill will handle everything
155
+
156
+
## Best Practices
157
+
158
+
1. **Test First**: Start with a small number of days (e.g., 1-7) to test your setup
159
+
2. **Monitor Resources**: Watch memory and CPU usage during initial runs
160
+
3. **Off-Peak Hours**: Run large backfills during low-traffic periods
161
+
4. **Incremental Approach**: For large histories, consider multiple smaller backfills
162
+
5. **Database Maintenance**: Run `VACUUM` and `ANALYZE` after large backfills
163
+
164
+
## Support
165
+
166
+
For issues or questions:
167
+
1. Check logs: `docker-compose logs python-worker`
168
+
2. Review this guide
169
+
3. Check `python-firehose/README.backfill.md` for technical details
+113
docker-compose.python-default.yml
+113
docker-compose.python-default.yml
···
1
+
# Default Docker Compose - Python Unified Worker with Backfill
2
+
# This is the recommended configuration that uses Python for all firehose processing
3
+
# TypeScript backfill is permanently disabled
4
+
5
+
services:
6
+
db:
7
+
image: postgres:14
8
+
command: postgres -c max_connections=500 -c shared_buffers=20GB -c effective_cache_size=42GB -c work_mem=256MB -c maintenance_work_mem=8GB -c max_parallel_workers=32 -c max_parallel_workers_per_gather=8 -c max_wal_size=8GB
9
+
environment:
10
+
POSTGRES_DB: atproto
11
+
POSTGRES_USER: postgres
12
+
POSTGRES_PASSWORD: password
13
+
volumes:
14
+
- postgres_data:/var/lib/postgresql/data,Z
15
+
ports:
16
+
- "5432:5432"
17
+
healthcheck:
18
+
test: ["CMD-SHELL", "pg_isready -U postgres -d atproto"]
19
+
interval: 10s
20
+
timeout: 5s
21
+
retries: 5
22
+
restart: unless-stopped
23
+
shm_size: 10gb
24
+
25
+
# Python Unified Worker - Handles both firehose and backfill
26
+
# This is the PRIMARY worker that processes all AT Protocol events
27
+
python-worker:
28
+
build:
29
+
context: ./python-firehose
30
+
dockerfile: Dockerfile.unified
31
+
environment:
32
+
# Core configuration
33
+
- RELAY_URL=${RELAY_URL:-wss://bsky.network}
34
+
- DATABASE_URL=postgresql://postgres:password@db:5432/atproto
35
+
- DB_POOL_SIZE=${DB_POOL_SIZE:-20}
36
+
- LOG_LEVEL=${LOG_LEVEL:-INFO}
37
+
38
+
# BACKFILL CONFIGURATION
39
+
# Set BACKFILL_DAYS to enable historical data import:
40
+
# - 0 = disabled (default)
41
+
# - -1 = total history (entire available rollback window)
42
+
# - N = specific number of days (e.g., 7, 30, 365)
43
+
- BACKFILL_DAYS=${BACKFILL_DAYS:-0}
44
+
45
+
# Backfill resource throttling (optional - defaults are conservative)
46
+
- BACKFILL_BATCH_SIZE=${BACKFILL_BATCH_SIZE:-5}
47
+
- BACKFILL_BATCH_DELAY_MS=${BACKFILL_BATCH_DELAY_MS:-2000}
48
+
- BACKFILL_MAX_CONCURRENT=${BACKFILL_MAX_CONCURRENT:-2}
49
+
- BACKFILL_MAX_MEMORY_MB=${BACKFILL_MAX_MEMORY_MB:-512}
50
+
- BACKFILL_USE_IDLE=${BACKFILL_USE_IDLE:-true}
51
+
52
+
# Worker identification
53
+
- WORKER_ID=0 # Primary worker - runs backfill
54
+
depends_on:
55
+
db:
56
+
condition: service_healthy
57
+
healthcheck:
58
+
test: ["CMD-SHELL", "python -c \"import asyncpg; import asyncio; asyncio.run(asyncpg.connect('postgresql://postgres:password@db:5432/atproto', timeout=5).close())\" || exit 1"]
59
+
interval: 30s
60
+
timeout: 10s
61
+
start_period: 40s
62
+
retries: 3
63
+
restart: unless-stopped
64
+
deploy:
65
+
resources:
66
+
limits:
67
+
memory: ${WORKER_MEMORY_LIMIT:-4G}
68
+
reservations:
69
+
memory: 1G
70
+
71
+
# Frontend/API server - TypeScript (backfill disabled)
72
+
app:
73
+
volumes:
74
+
- ./appview-signing-key.json:/app/appview-signing-key.json:ro,Z
75
+
- ./appview-private.pem:/app/appview-private.pem:ro,Z
76
+
- ./public/did.json:/app/public/did.json:ro,Z
77
+
- ./oauth-keyset.json:/app/oauth-keyset.json:ro,Z
78
+
build: .
79
+
ports:
80
+
- "5000:5000"
81
+
environment:
82
+
- DATABASE_URL=postgresql://postgres:password@db:5432/atproto
83
+
- SESSION_SECRET=${SESSION_SECRET:-change-this-to-a-random-secret-in-production}
84
+
- APPVIEW_DID=${APPVIEW_DID:-did:web:appview.dollspace.gay}
85
+
- DATA_RETENTION_DAYS=${DATA_RETENTION_DAYS:-0}
86
+
- DB_POOL_SIZE=50
87
+
- PORT=5000
88
+
- NODE_ENV=production
89
+
- OAUTH_KEYSET_PATH=/app/oauth-keyset.json
90
+
- ADMIN_DIDS=${ADMIN_DIDS:-did:plc:abc123xyz,admin.bsky.social,did:plc:def456uvw}
91
+
92
+
# TypeScript firehose and backfill are PERMANENTLY DISABLED
93
+
- FIREHOSE_ENABLED=false
94
+
- BACKFILL_DAYS=0 # Force disabled - Python handles all backfill
95
+
depends_on:
96
+
db:
97
+
condition: service_healthy
98
+
healthcheck:
99
+
test: ["CMD-SHELL", "node -e \"require('http').get('http://localhost:5000/health', (r) => {process.exit(r.statusCode === 200 ? 0 : 1)})\""]
100
+
interval: 30s
101
+
timeout: 10s
102
+
start_period: 40s
103
+
retries: 3
104
+
restart: unless-stopped
105
+
deploy:
106
+
resources:
107
+
limits:
108
+
memory: 2G
109
+
reservations:
110
+
memory: 512M
111
+
112
+
volumes:
113
+
postgres_data:
+59
migrate-to-python-backfill.sh
+59
migrate-to-python-backfill.sh
···
1
+
#!/bin/bash
2
+
# Migration script from TypeScript to Python backfill
3
+
4
+
echo "=========================================="
5
+
echo "Migrating to Python Backfill Service"
6
+
echo "=========================================="
7
+
echo ""
8
+
echo "This script will help you migrate from the TypeScript backfill"
9
+
echo "to the new Python backfill service."
10
+
echo ""
11
+
12
+
# Check if docker-compose is running
13
+
if docker-compose ps | grep -q "Up"; then
14
+
echo "⚠️ WARNING: Docker services are currently running."
15
+
echo "Please stop them first with: docker-compose down"
16
+
exit 1
17
+
fi
18
+
19
+
# Create backup of current docker-compose.yml
20
+
if [ -f docker-compose.yml ]; then
21
+
echo "📁 Backing up current docker-compose.yml to docker-compose.yml.backup"
22
+
cp docker-compose.yml docker-compose.yml.backup
23
+
fi
24
+
25
+
# Use the Python default configuration
26
+
echo "📝 Setting up Python-based configuration..."
27
+
cp docker-compose.python-default.yml docker-compose.yml
28
+
29
+
echo ""
30
+
echo "✅ Migration complete!"
31
+
echo ""
32
+
echo "=========================================="
33
+
echo "How to use Python backfill:"
34
+
echo "=========================================="
35
+
echo ""
36
+
echo "1. Start with backfill disabled (default):"
37
+
echo " docker-compose up"
38
+
echo ""
39
+
echo "2. Enable backfill for specific days:"
40
+
echo " BACKFILL_DAYS=7 docker-compose up"
41
+
echo ""
42
+
echo "3. Enable total history backfill:"
43
+
echo " BACKFILL_DAYS=-1 docker-compose up"
44
+
echo ""
45
+
echo "4. Customize resources (example):"
46
+
echo " BACKFILL_DAYS=30 \\"
47
+
echo " BACKFILL_BATCH_SIZE=20 \\"
48
+
echo " BACKFILL_MAX_MEMORY_MB=1024 \\"
49
+
echo " docker-compose up"
50
+
echo ""
51
+
echo "=========================================="
52
+
echo "Important changes:"
53
+
echo "=========================================="
54
+
echo "✅ TypeScript backfill is permanently disabled"
55
+
echo "✅ Python worker handles all backfill operations"
56
+
echo "✅ Same BACKFILL_DAYS environment variable"
57
+
echo "✅ Progress saved to database for resume support"
58
+
echo ""
59
+
echo "For more information, see PYTHON_BACKFILL_GUIDE.md"
+12
-73
server/routes.ts
+12
-73
server/routes.ts
···
693
693
}
694
694
});
695
695
696
+
// TypeScript backfill is PERMANENTLY DISABLED
697
+
// Use Python backfill service instead: python-firehose/backfill_service.py
696
698
app.post("/api/user/backfill", csrfProtection.validateToken, requireAuth, async (req: AuthRequest, res) => {
697
-
try {
698
-
if (!req.session) {
699
-
return res.status(401).json({ error: "Not authenticated" });
700
-
}
701
-
702
-
const session = await storage.getSession(req.session.sessionId);
703
-
if (!session) {
704
-
return res.status(404).json({ error: "Session not found" });
705
-
}
706
-
707
-
const schema = z.object({
708
-
days: z.number().min(0).max(3650), // 0 = all data, max 10 years
709
-
});
710
-
711
-
const data = schema.parse(req.body);
712
-
const userDid = session.userDid;
713
-
714
-
if (data.days === 0 || data.days > 3) {
715
-
const { repoBackfillService } = await import("./services/repo-backfill");
716
-
717
-
repoBackfillService.backfillSingleRepo(userDid, data.days).then(() => {
718
-
console.log(`[USER_BACKFILL] Completed repository backfill for ${userDid}`);
719
-
}).catch((error: Error) => {
720
-
console.error(`[USER_BACKFILL] Failed repository backfill for ${userDid}:`, error);
721
-
});
722
-
723
-
const message = data.days === 0
724
-
? `Backfill started for ALL your data. Your complete repository is being imported from your PDS.`
725
-
: `Backfill started for the last ${data.days} days. Your repository is being imported from your PDS.`;
726
-
727
-
res.json({
728
-
message,
729
-
type: "repository"
730
-
});
731
-
} else {
732
-
res.json({
733
-
message: `Recent data backfill (${data.days} days) will be handled by the firehose.`,
734
-
type: "firehose"
735
-
});
736
-
}
737
-
738
-
await storage.updateUserSettings(userDid, {
739
-
lastBackfillAt: new Date(),
740
-
});
741
-
} catch (error) {
742
-
console.error("[USER_BACKFILL] Error:", error);
743
-
res.status(400).json({ error: error instanceof Error ? error.message : "Failed to start backfill" });
744
-
}
699
+
res.status(501).json({
700
+
error: "TypeScript backfill has been disabled. Please use the Python backfill service instead.",
701
+
info: "Set BACKFILL_DAYS environment variable and run the Python unified worker."
702
+
});
745
703
});
746
704
747
705
app.post("/api/user/delete-data", deletionLimiter, csrfProtection.validateToken, requireAuth, async (req: AuthRequest, res) => {
···
2182
2140
}
2183
2141
});
2184
2142
2185
-
// Backfill test endpoint - backfill a single repository
2143
+
// TypeScript backfill test endpoint is PERMANENTLY DISABLED
2144
+
// Use Python backfill service instead
2186
2145
app.post("/api/backfill/repo", async (req, res) => {
2187
-
try {
2188
-
const schema = z.object({
2189
-
did: z.string(),
2190
-
});
2191
-
2192
-
const data = schema.parse(req.body);
2193
-
const { repoBackfillService } = await import("./services/repo-backfill");
2194
-
2195
-
console.log(`[API] Starting repo backfill for ${data.did}...`);
2196
-
// Skip date check for test endpoint to allow testing even when BACKFILL_DAYS=0
2197
-
await repoBackfillService.backfillSingleRepo(data.did);
2198
-
2199
-
const progress = repoBackfillService.getProgress();
2200
-
res.json({
2201
-
success: true,
2202
-
did: data.did,
2203
-
progress
2204
-
});
2205
-
} catch (error) {
2206
-
console.error("[API] Repo backfill error:", error);
2207
-
res.status(500).json({
2208
-
error: error instanceof Error ? error.message : "Failed to backfill repo"
2209
-
});
2210
-
}
2146
+
res.status(501).json({
2147
+
error: "TypeScript backfill has been disabled. Please use the Python backfill service instead.",
2148
+
info: "Set BACKFILL_DAYS environment variable and run the Python unified worker."
2149
+
});
2211
2150
});
2212
2151
2213
2152
// XRPC API Endpoints