music on atproto
plyr.fm
connection pool exhaustion#
symptoms#
- 500 errors across multiple endpoints
- 30-second request timeouts
- logfire shows:
QueuePool limit of size 10 overflow 5 reached, connection timed out - queue listener logs:
queue listener connection lost, attempting reconnect - database connection errors mentioning multiple Neon IP addresses timing out
observed behavior (2025-12-08 incident)#
evidence from logfire spans:
| time (UTC) | event | duration |
|---|---|---|
| 06:32:40 | queue service connected | - |
| 06:32:50-06:33:29 | SQLAlchemy connects succeeding | 3-6ms |
| 06:33:36 | queue heartbeat times out | 5s timeout |
| 06:33:36-06:36:04 | ~2.5 min gap with no spans | - |
| 06:36:04 | GET /albums starts | hangs 24 min |
| 06:36:06 | GET /moderation starts | succeeds in 14ms |
| 06:36:06 | GET /auth/me starts | hangs 18 min |
| 06:36:31 | multiple requests | succeed in 3-15ms |
key observation: some connections succeed in 3ms while others hang for 20+ minutes simultaneously. the stuck connections show psycopg retrying across 12 different Neon IP addresses.
what we know#
- the queue listener heartbeat (
SELECT 1) times out after 5 seconds - psycopg retries connection attempts across multiple IPs when one fails
- each IP retry has its own timeout, so total time = timeout × number of IPs
- some connections succeed immediately while others get stuck
- restarting the fly machines clears the stuck connections
what we don't know#
- why some connections succeed while others fail simultaneously
- whether this is a Neon proxy issue, DNS issue, or application issue
- why psycopg doesn't give up after a reasonable total timeout
remediation#
restart the fly machines to clear stuck connections:
# list machines
fly machines list -a relay-api
# restart both machines
fly machines restart <machine-id-1> <machine-id-2> -a relay-api
verification#
check logfire for healthy spans after restart:
SELECT
span_name,
message,
start_timestamp,
duration * 1000 as duration_ms,
otel_status_code
FROM records
WHERE deployment_environment = 'production'
AND start_timestamp > NOW() - INTERVAL '5 minutes'
ORDER BY start_timestamp DESC
LIMIT 30
you should see:
queue service connected to database and listening- database queries completing in <50ms
- no ERROR status codes
incident history#
- 2025-11-17: first occurrence, queue listener hung indefinitely (fixed by adding timeout)
- 2025-12-02: cold start variant, 10 errors (fixed by increasing pool size)
- 2025-12-08: 37 errors in one hour, some connections stuck 20+ min while others worked
future investigation#
- consider adding a total connection timeout that caps retries across all IPs
- investigate whether disabling IPv6 reduces retry time
- add monitoring/alerting for queue listener disconnects
- consider circuit breaker pattern to fail fast when connections are failing