music on atproto
plyr.fm
1# database connection pooling
2
3configuration and best practices for managing database connections in production.
4
5## overview
6
7plyr.fm uses SQLAlchemy's connection pooling to manage PostgreSQL connections efficiently. proper pool configuration is critical for:
8
9- preventing connection exhaustion under load
10- failing fast when the database is slow or unresponsive
11- optimizing resource usage across concurrent requests
12
13## configuration
14
15all settings are configurable via environment variables and defined in `src/backend/config.py`.
16
17### timeouts
18
19```bash
20# how long a single SQL query can run before being killed (default: 10s)
21DATABASE_STATEMENT_TIMEOUT=10.0
22
23# how long to wait when establishing a new database connection (default: 10s)
24# set higher than Neon cold start latency (~5-10s) to allow wake-up
25DATABASE_CONNECTION_TIMEOUT=10.0
26
27# how long to wait for an available connection from the pool (default: = connection_timeout)
28# this is automatically set to match DATABASE_CONNECTION_TIMEOUT
29```
30
31**why these matter:**
32
33- **statement_timeout**: prevents runaway queries from holding connections indefinitely. set based on your slowest expected query.
34- **connection_timeout**: fails fast when the database is slow or unreachable. set higher than Neon cold start latency (5-10s) to allow serverless databases to wake up after idle periods.
35- **pool_timeout**: fails fast when all connections are busy. without this, requests wait forever when the pool is exhausted.
36
37### connection pool sizing
38
39```bash
40# number of persistent connections to maintain (default: 10)
41DATABASE_POOL_SIZE=10
42
43# additional connections to create on demand when pool is exhausted (default: 5)
44DATABASE_MAX_OVERFLOW=5
45
46# how long before recycling a connection, in seconds (default: 7200 = 2 hours)
47DATABASE_POOL_RECYCLE=7200
48
49# verify connection health before using from pool (default: true)
50DATABASE_POOL_PRE_PING=true
51```
52
53**sizing considerations:**
54
55total max connections = `pool_size` + `max_overflow` = 15 by default
56
57**pool_size:**
58- too small: connection contention, requests wait for available connections
59- too large: wastes memory and database resources
60- default of 10 handles Neon cold start scenarios where multiple requests arrive after idle periods
61
62**max_overflow:**
63- provides burst capacity for traffic spikes
64- default of 5 allows 15 total connections under peak load
65- connections beyond pool_size are closed when returned (not kept idle)
66
67**pool_recycle:**
68- prevents stale connections from lingering
69- should be less than your database's connection timeout
70- 2 hours is a safe default for most PostgreSQL configurations
71
72**pool_pre_ping:**
73- adds small overhead (SELECT 1) before each connection use
74- prevents using connections that were closed by the database
75- recommended for production to avoid connection errors
76
77## Neon serverless considerations
78
79plyr.fm uses Neon PostgreSQL, which scales to zero after periods of inactivity. this introduces **cold start latency** that affects connection pooling:
80
81### the cold start problem
82
831. site is idle for several minutes → Neon scales down
842. first request arrives → Neon needs 5-10s to wake up
853. if pool_size is too small, all connections hang waiting for Neon
864. new requests can't get connections → 500 errors
87
88### how we mitigate this
89
90**larger connection pool (pool_size=10, max_overflow=5):**
91- allows 15 concurrent requests to wait for Neon wake-up
92- prevents pool exhaustion during cold start
93
94**appropriate connection timeout (10s):**
95- long enough to wait for Neon cold start (~5-10s)
96- short enough to fail fast on true database outages
97
98**queue listener heartbeat:**
99- background task pings database every 5s
100- detects connection death before user requests fail
101- triggers reconnection with exponential backoff
102
103### incident history
104
105- **2025-11-17**: first pool exhaustion outage - queue listener hung indefinitely on slow database. fix: added 15s timeout to asyncpg.connect() in queue service.
106- **2025-12-02**: cold start recurrence - 5 minute idle period caused Neon to scale down. first 5 requests after wake-up hung for 3-5 minutes each, exhausting pool. fix: increased pool_size to 10, max_overflow to 5, connection_timeout to 10s.
107
108## production best practices
109
110### current deployment (Neon serverless)
111
112```bash
113# pool sized for cold start scenarios
114DATABASE_POOL_SIZE=10
115DATABASE_MAX_OVERFLOW=5
116
117# timeout accounts for Neon wake-up latency
118DATABASE_STATEMENT_TIMEOUT=10.0
119DATABASE_CONNECTION_TIMEOUT=10.0
120
121# standard recycle
122DATABASE_POOL_RECYCLE=7200
123```
124
125this configuration:
126- handles 15 concurrent requests during Neon cold start
127- fails fast (10s) on true database issues
128- balances resource usage with reliability
129
130### if seeing pool exhaustion
131
132**option 1: increase pool size**
133```bash
134DATABASE_POOL_SIZE=15
135DATABASE_MAX_OVERFLOW=5
136```
137pros: more concurrent capacity during cold starts
138cons: more database connections when warm
139
140**option 2: increase overflow**
141```bash
142DATABASE_POOL_SIZE=10
143DATABASE_MAX_OVERFLOW=10 # allows 20 total under burst
144```
145pros: higher burst capacity, same baseline
146cons: less predictable peak resource usage
147
148### tuning statement timeout
149
150adjust based on your query patterns:
151
152```bash
153# strict timeout for API workloads
154DATABASE_STATEMENT_TIMEOUT=2.0
155
156# lenient for long-running operations (uploads, processing)
157DATABASE_STATEMENT_TIMEOUT=30.0
158```
159
160analyze slow query logs to understand p99 query latency, then set timeout with appropriate headroom.
161
162## monitoring
163
164### what to watch
165
1661. **pool utilization**: `engine.pool.checkedout()` / `pool_size`
167 - alert if > 80% for sustained periods
168 - indicates need to increase pool_size
169
1702. **connection timeouts**: rate of pool timeout errors
171 - indicates pool exhaustion
172 - consider increasing pool_size or max_overflow
173
1743. **statement timeouts**: rate of query timeout errors
175 - indicates slow queries or inappropriate timeout
176 - investigate slow queries, consider adjusting timeout
177
1784. **connection errors**: failed connection attempts
179 - database availability issues
180 - network problems
181
182### observability
183
184the engine is instrumented with logfire when observability is enabled (`LOGFIRE_ENABLED=true`). this provides:
185
186- connection pool metrics
187- query execution times
188- timeout events
189- connection errors
190
191review logfire traces to understand:
192- which queries are slow
193- when pool exhaustion occurs
194- connection patterns under load
195
196## how it works
197
198### connection lifecycle
199
2001. **request arrives** → acquire connection from pool
2012. **execute query** → use connection (with statement_timeout)
2023. **request complete** → return connection to pool
2034. **recycle threshold** → close and replace old connections
204
205### failure modes
206
207**pool exhausted:**
208- all `pool_size` + `max_overflow` connections in use
209- new requests wait for `pool_timeout` seconds
210- timeout → 503 error (fail fast)
211
212**database slow:**
213- connection attempt exceeds `connection_timeout`
214- timeout → fail fast, can retry
215
216**query too slow:**
217- query exceeds `statement_timeout`
218- query killed, connection returned to pool
219- prevents one slow query from blocking others
220
221**connection died:**
222- `pool_pre_ping` detects dead connection
223- connection discarded, new one created
224- prevents errors from using stale connections
225
226## troubleshooting
227
228### 503 errors (pool exhausted)
229
230**symptoms:**
231- `QueuePool limit of size N overflow M reached`
232- requests timing out waiting for connections
233
234**diagnosis:**
235```python
236# check pool status
237engine = get_engine()
238print(f"checked out: {engine.pool.checkedout()}")
239print(f"pool size: {engine.pool.size()}")
240```
241
242**solutions:**
2431. increase `DATABASE_POOL_SIZE`
2442. add `DATABASE_MAX_OVERFLOW` for burst capacity
2453. investigate slow queries holding connections
2464. check for connection leaks (connections not being returned)
247
248### connection timeouts
249
250**symptoms:**
251- `asyncpg.exceptions.ConnectionTimeoutError`
252- requests failing to connect to database
253
254**diagnosis:**
255- check database availability
256- check network latency
257- review logfire connection traces
258
259**solutions:**
2601. verify database is responsive
2612. increase `DATABASE_CONNECTION_TIMEOUT` if network latency is high
2623. investigate database performance (CPU, I/O)
263
264### statement timeouts
265
266**symptoms:**
267- `asyncpg.exceptions.QueryCanceledError`
268- specific queries timing out
269
270**diagnosis:**
271- identify slow queries in logs
272- check query execution plans
273- review database indexes
274
275**solutions:**
2761. optimize slow queries
2772. add database indexes
2783. increase `DATABASE_STATEMENT_TIMEOUT` if queries are legitimately slow
2794. consider background processing for long operations
280
281## implementation details
282
283### asyncpg-specific configuration
284
285when using `postgresql+asyncpg://` URLs, additional connection settings are applied:
286
287```python
288connect_args = {
289 # unique prepared statement names per connection
290 "prepared_statement_name_func": lambda: f"__asyncpg_{uuid.uuid4()}__",
291
292 # disable statement caching (prevents prepared statement conflicts)
293 "statement_cache_size": 0,
294 "prepared_statement_cache_size": 0,
295
296 # statement timeout
297 "command_timeout": DATABASE_STATEMENT_TIMEOUT,
298
299 # connection timeout
300 "timeout": DATABASE_CONNECTION_TIMEOUT,
301}
302```
303
304these settings prevent prepared statement caching issues in asyncpg while maintaining performance.
305
306### LIFO pool strategy
307
308connections are returned in LIFO (last-in, first-out) order:
309
310```python
311pool_use_lifo=True
312```
313
314benefits:
315- recently used connections are more likely to still be valid
316- helps shed excess connections after load spikes
317- increases likelihood of connection cache hits
318
319## references
320
321- implementation: `src/backend/utilities/database.py`
322- settings: `src/backend/config.py`
323- sqlalchemy pooling: https://docs.sqlalchemy.org/en/20/core/pooling.html
324- asyncpg connections: https://magicstack.github.io/asyncpg/current/api/index.html#connection