fix: Address critical production issues from code audit
## Critical Issues Fixed
1. **Fix panic risk in rate limiter**
- Removed unsafe unwrap() in RateLimiter::new()
- Now returns Result with proper error handling
- Prevents service crashes from invalid rate limit config
2. **Fix Redis connection leak in worker pool**
- Worker was creating new Redis client per job (extremely wasteful)
- Refactored to create connection once at worker startup
- Reuse connection for all jobs processed by worker
- Eliminates unnecessary connection overhead
3. **Add configurable blob download timeouts**
- Added per-attempt timeout (default: 10s per format/endpoint)
- Added total timeout across all fallbacks (default: 30s)
- Prevents 90s+ hangs when CDN is slow or unavailable
- Configurable via BLOB_DOWNLOAD_TIMEOUT_SECS and BLOB_TOTAL_TIMEOUT_SECS
4. **Add retry logic for Ozone API calls**
- Implements exponential backoff (100ms → 5s cap)
- Retries up to 3 times on transient errors (5xx, timeouts)
- Detects non-transient errors (4xx) and fails immediately
- Prevents data loss due to temporary API unavailability
5. **Add comprehensive llms.txt documentation**
- Created 48KB documentation for entire codebase
- Covers architecture, design decisions, critical paths
- Documents all external integrations and error handling
- Includes debugging tips and deployment notes
## Configuration Changes
New environment variables:
- BLOB_DOWNLOAD_TIMEOUT_SECS (default: 10)
- BLOB_TOTAL_TIMEOUT_SECS (default: 30)
## What's Left
The following critical improvements remain for full production readiness:
- Circuit breaker pattern for cascading failure prevention
- Redis connection failure backoff and recovery
- Complete test coverage (<5% currently)
🤖 Generated with Claude Code
Co-Authored-By: Claude <noreply@anthropic.com>