docs/runbooks/connection-pool-exhaustion.md at main · zzstoatzz.io/plyr.fm

zzstoatzz.io / plyr.fm
music on atproto plyr.fm
plyr.fm / docs / runbooks / connection-pool-exhaustion.md
at main 3.1 kB view raw view rendered
 1# connection pool exhaustion
 2
 3## symptoms
 4
 5- 500 errors across multiple endpoints
 6- 30-second request timeouts
 7- logfire shows: `QueuePool limit of size 10 overflow 5 reached, connection timed out`
 8- queue listener logs: `queue listener connection lost, attempting reconnect`
 9- database connection errors mentioning multiple Neon IP addresses timing out
10
11## observed behavior (2025-12-08 incident)
12
13evidence from logfire spans:
14
15| time (UTC) | event | duration |
16|------------|-------|----------|
17| 06:32:40 | queue service connected | - |
18| 06:32:50-06:33:29 | SQLAlchemy connects succeeding | 3-6ms |
19| 06:33:36 | queue heartbeat times out | 5s timeout |
20| 06:33:36-06:36:04 | ~2.5 min gap with no spans | - |
21| 06:36:04 | GET /albums starts | hangs 24 min |
22| 06:36:06 | GET /moderation starts | **succeeds in 14ms** |
23| 06:36:06 | GET /auth/me starts | hangs 18 min |
24| 06:36:31 | multiple requests | **succeed in 3-15ms** |
25
26key observation: **some connections succeed in 3ms while others hang for 20+ minutes simultaneously**. the stuck connections show psycopg retrying across 12 different Neon IP addresses.
27
28## what we know
29
301. the queue listener heartbeat (`SELECT 1`) times out after 5 seconds
312. psycopg retries connection attempts across multiple IPs when one fails
323. each IP retry has its own timeout, so total time = timeout × number of IPs
334. some connections succeed immediately while others get stuck
345. restarting the fly machines clears the stuck connections
35
36## what we don't know
37
38- why some connections succeed while others fail simultaneously
39- whether this is a Neon proxy issue, DNS issue, or application issue
40- why psycopg doesn't give up after a reasonable total timeout
41
42## remediation
43
44restart the fly machines to clear stuck connections:
45
46```bash
47# list machines
48fly machines list -a relay-api
49
50# restart both machines
51fly machines restart <machine-id-1> <machine-id-2> -a relay-api
52```
53
54## verification
55
56check logfire for healthy spans after restart:
57
58```sql
59SELECT
60  span_name,
61  message,
62  start_timestamp,
63  duration * 1000 as duration_ms,
64  otel_status_code
65FROM records
66WHERE deployment_environment = 'production'
67  AND start_timestamp > NOW() - INTERVAL '5 minutes'
68ORDER BY start_timestamp DESC
69LIMIT 30
70```
71
72you should see:
73- `queue service connected to database and listening`
74- database queries completing in <50ms
75- no ERROR status codes
76
77## incident history
78
79- **2025-11-17**: first occurrence, queue listener hung indefinitely (fixed by adding timeout)
80- **2025-12-02**: cold start variant, 10 errors (fixed by increasing pool size)
81- **2025-12-08**: 37 errors in one hour, some connections stuck 20+ min while others worked
82
83## future investigation
84
85- consider adding a total connection timeout that caps retries across all IPs
86- investigate whether disabling IPv6 reduces retry time
87- add monitoring/alerting for queue listener disconnects
88- consider circuit breaker pattern to fail fast when connections are failing
89
90## related docs
91
92- [connection pooling config](../backend/database/connection-pooling.md)
93- [logfire querying guide](../tools/logfire.md)