1# connection pool exhaustion 2 3## symptoms 4 5- 500 errors across multiple endpoints 6- 30-second request timeouts 7- logfire shows: `QueuePool limit of size 10 overflow 5 reached, connection timed out` 8- queue listener logs: `queue listener connection lost, attempting reconnect` 9- database connection errors mentioning multiple Neon IP addresses timing out 10 11## observed behavior (2025-12-08 incident) 12 13evidence from logfire spans: 14 15| time (UTC) | event | duration | 16|------------|-------|----------| 17| 06:32:40 | queue service connected | - | 18| 06:32:50-06:33:29 | SQLAlchemy connects succeeding | 3-6ms | 19| 06:33:36 | queue heartbeat times out | 5s timeout | 20| 06:33:36-06:36:04 | ~2.5 min gap with no spans | - | 21| 06:36:04 | GET /albums starts | hangs 24 min | 22| 06:36:06 | GET /moderation starts | **succeeds in 14ms** | 23| 06:36:06 | GET /auth/me starts | hangs 18 min | 24| 06:36:31 | multiple requests | **succeed in 3-15ms** | 25 26key observation: **some connections succeed in 3ms while others hang for 20+ minutes simultaneously**. the stuck connections show psycopg retrying across 12 different Neon IP addresses. 27 28## what we know 29 301. the queue listener heartbeat (`SELECT 1`) times out after 5 seconds 312. psycopg retries connection attempts across multiple IPs when one fails 323. each IP retry has its own timeout, so total time = timeout × number of IPs 334. some connections succeed immediately while others get stuck 345. restarting the fly machines clears the stuck connections 35 36## what we don't know 37 38- why some connections succeed while others fail simultaneously 39- whether this is a Neon proxy issue, DNS issue, or application issue 40- why psycopg doesn't give up after a reasonable total timeout 41 42## remediation 43 44restart the fly machines to clear stuck connections: 45 46```bash 47# list machines 48fly machines list -a relay-api 49 50# restart both machines 51fly machines restart <machine-id-1> <machine-id-2> -a relay-api 52``` 53 54## verification 55 56check logfire for healthy spans after restart: 57 58```sql 59SELECT 60 span_name, 61 message, 62 start_timestamp, 63 duration * 1000 as duration_ms, 64 otel_status_code 65FROM records 66WHERE deployment_environment = 'production' 67 AND start_timestamp > NOW() - INTERVAL '5 minutes' 68ORDER BY start_timestamp DESC 69LIMIT 30 70``` 71 72you should see: 73- `queue service connected to database and listening` 74- database queries completing in <50ms 75- no ERROR status codes 76 77## incident history 78 79- **2025-11-17**: first occurrence, queue listener hung indefinitely (fixed by adding timeout) 80- **2025-12-02**: cold start variant, 10 errors (fixed by increasing pool size) 81- **2025-12-08**: 37 errors in one hour, some connections stuck 20+ min while others worked 82 83## future investigation 84 85- consider adding a total connection timeout that caps retries across all IPs 86- investigate whether disabling IPv6 reduces retry time 87- add monitoring/alerting for queue listener disconnects 88- consider circuit breaker pattern to fail fast when connections are failing 89 90## related docs 91 92- [connection pooling config](../backend/database/connection-pooling.md) 93- [logfire querying guide](../tools/logfire.md)