docs/backend/database/connection-pooling.md at main · zzstoatzz.io/plyr.fm

zzstoatzz.io / plyr.fm
music on atproto plyr.fm
plyr.fm / docs / backend / database / connection-pooling.md
at main 9.9 kB view raw view rendered
  1# database connection pooling
  2
  3configuration and best practices for managing database connections in production.
  4
  5## overview
  6
  7plyr.fm uses SQLAlchemy's connection pooling to manage PostgreSQL connections efficiently. proper pool configuration is critical for:
  8
  9- preventing connection exhaustion under load
 10- failing fast when the database is slow or unresponsive
 11- optimizing resource usage across concurrent requests
 12
 13## configuration
 14
 15all settings are configurable via environment variables and defined in `src/backend/config.py`.
 16
 17### timeouts
 18
 19```bash
 20# how long a single SQL query can run before being killed (default: 10s)
 21DATABASE_STATEMENT_TIMEOUT=10.0
 22
 23# how long to wait when establishing a new database connection (default: 10s)
 24# set higher than Neon cold start latency (~5-10s) to allow wake-up
 25DATABASE_CONNECTION_TIMEOUT=10.0
 26
 27# how long to wait for an available connection from the pool (default: = connection_timeout)
 28# this is automatically set to match DATABASE_CONNECTION_TIMEOUT
 29```
 30
 31**why these matter:**
 32
 33- **statement_timeout**: prevents runaway queries from holding connections indefinitely. set based on your slowest expected query.
 34- **connection_timeout**: fails fast when the database is slow or unreachable. set higher than Neon cold start latency (5-10s) to allow serverless databases to wake up after idle periods.
 35- **pool_timeout**: fails fast when all connections are busy. without this, requests wait forever when the pool is exhausted.
 36
 37### connection pool sizing
 38
 39```bash
 40# number of persistent connections to maintain (default: 10)
 41DATABASE_POOL_SIZE=10
 42
 43# additional connections to create on demand when pool is exhausted (default: 5)
 44DATABASE_MAX_OVERFLOW=5
 45
 46# how long before recycling a connection, in seconds (default: 7200 = 2 hours)
 47DATABASE_POOL_RECYCLE=7200
 48
 49# verify connection health before using from pool (default: true)
 50DATABASE_POOL_PRE_PING=true
 51```
 52
 53**sizing considerations:**
 54
 55total max connections = `pool_size` + `max_overflow` = 15 by default
 56
 57**pool_size:**
 58- too small: connection contention, requests wait for available connections
 59- too large: wastes memory and database resources
 60- default of 10 handles Neon cold start scenarios where multiple requests arrive after idle periods
 61
 62**max_overflow:**
 63- provides burst capacity for traffic spikes
 64- default of 5 allows 15 total connections under peak load
 65- connections beyond pool_size are closed when returned (not kept idle)
 66
 67**pool_recycle:**
 68- prevents stale connections from lingering
 69- should be less than your database's connection timeout
 70- 2 hours is a safe default for most PostgreSQL configurations
 71
 72**pool_pre_ping:**
 73- adds small overhead (SELECT 1) before each connection use
 74- prevents using connections that were closed by the database
 75- recommended for production to avoid connection errors
 76
 77## Neon serverless considerations
 78
 79plyr.fm uses Neon PostgreSQL, which scales to zero after periods of inactivity. this introduces **cold start latency** that affects connection pooling:
 80
 81### the cold start problem
 82
 831. site is idle for several minutes → Neon scales down
 842. first request arrives → Neon needs 5-10s to wake up
 853. if pool_size is too small, all connections hang waiting for Neon
 864. new requests can't get connections → 500 errors
 87
 88### how we mitigate this
 89
 90**larger connection pool (pool_size=10, max_overflow=5):**
 91- allows 15 concurrent requests to wait for Neon wake-up
 92- prevents pool exhaustion during cold start
 93
 94**appropriate connection timeout (10s):**
 95- long enough to wait for Neon cold start (~5-10s)
 96- short enough to fail fast on true database outages
 97
 98**queue listener heartbeat:**
 99- background task pings database every 5s
100- detects connection death before user requests fail
101- triggers reconnection with exponential backoff
102
103### incident history
104
105- **2025-11-17**: first pool exhaustion outage - queue listener hung indefinitely on slow database. fix: added 15s timeout to asyncpg.connect() in queue service.
106- **2025-12-02**: cold start recurrence - 5 minute idle period caused Neon to scale down. first 5 requests after wake-up hung for 3-5 minutes each, exhausting pool. fix: increased pool_size to 10, max_overflow to 5, connection_timeout to 10s.
107
108## production best practices
109
110### current deployment (Neon serverless)
111
112```bash
113# pool sized for cold start scenarios
114DATABASE_POOL_SIZE=10
115DATABASE_MAX_OVERFLOW=5
116
117# timeout accounts for Neon wake-up latency
118DATABASE_STATEMENT_TIMEOUT=10.0
119DATABASE_CONNECTION_TIMEOUT=10.0
120
121# standard recycle
122DATABASE_POOL_RECYCLE=7200
123```
124
125this configuration:
126- handles 15 concurrent requests during Neon cold start
127- fails fast (10s) on true database issues
128- balances resource usage with reliability
129
130### if seeing pool exhaustion
131
132**option 1: increase pool size**
133```bash
134DATABASE_POOL_SIZE=15
135DATABASE_MAX_OVERFLOW=5
136```
137pros: more concurrent capacity during cold starts
138cons: more database connections when warm
139
140**option 2: increase overflow**
141```bash
142DATABASE_POOL_SIZE=10
143DATABASE_MAX_OVERFLOW=10  # allows 20 total under burst
144```
145pros: higher burst capacity, same baseline
146cons: less predictable peak resource usage
147
148### tuning statement timeout
149
150adjust based on your query patterns:
151
152```bash
153# strict timeout for API workloads
154DATABASE_STATEMENT_TIMEOUT=2.0
155
156# lenient for long-running operations (uploads, processing)
157DATABASE_STATEMENT_TIMEOUT=30.0
158```
159
160analyze slow query logs to understand p99 query latency, then set timeout with appropriate headroom.
161
162## monitoring
163
164### what to watch
165
1661. **pool utilization**: `engine.pool.checkedout()` / `pool_size`
167   - alert if > 80% for sustained periods
168   - indicates need to increase pool_size
169
1702. **connection timeouts**: rate of pool timeout errors
171   - indicates pool exhaustion
172   - consider increasing pool_size or max_overflow
173
1743. **statement timeouts**: rate of query timeout errors
175   - indicates slow queries or inappropriate timeout
176   - investigate slow queries, consider adjusting timeout
177
1784. **connection errors**: failed connection attempts
179   - database availability issues
180   - network problems
181
182### observability
183
184the engine is instrumented with logfire when observability is enabled (`LOGFIRE_ENABLED=true`). this provides:
185
186- connection pool metrics
187- query execution times
188- timeout events
189- connection errors
190
191review logfire traces to understand:
192- which queries are slow
193- when pool exhaustion occurs
194- connection patterns under load
195
196## how it works
197
198### connection lifecycle
199
2001. **request arrives** → acquire connection from pool
2012. **execute query** → use connection (with statement_timeout)
2023. **request complete** → return connection to pool
2034. **recycle threshold** → close and replace old connections
204
205### failure modes
206
207**pool exhausted:**
208- all `pool_size` + `max_overflow` connections in use
209- new requests wait for `pool_timeout` seconds
210- timeout → 503 error (fail fast)
211
212**database slow:**
213- connection attempt exceeds `connection_timeout`
214- timeout → fail fast, can retry
215
216**query too slow:**
217- query exceeds `statement_timeout`
218- query killed, connection returned to pool
219- prevents one slow query from blocking others
220
221**connection died:**
222- `pool_pre_ping` detects dead connection
223- connection discarded, new one created
224- prevents errors from using stale connections
225
226## troubleshooting
227
228### 503 errors (pool exhausted)
229
230**symptoms:**
231- `QueuePool limit of size N overflow M reached`
232- requests timing out waiting for connections
233
234**diagnosis:**
235```python
236# check pool status
237engine = get_engine()
238print(f"checked out: {engine.pool.checkedout()}")
239print(f"pool size: {engine.pool.size()}")
240```
241
242**solutions:**
2431. increase `DATABASE_POOL_SIZE`
2442. add `DATABASE_MAX_OVERFLOW` for burst capacity
2453. investigate slow queries holding connections
2464. check for connection leaks (connections not being returned)
247
248### connection timeouts
249
250**symptoms:**
251- `asyncpg.exceptions.ConnectionTimeoutError`
252- requests failing to connect to database
253
254**diagnosis:**
255- check database availability
256- check network latency
257- review logfire connection traces
258
259**solutions:**
2601. verify database is responsive
2612. increase `DATABASE_CONNECTION_TIMEOUT` if network latency is high
2623. investigate database performance (CPU, I/O)
263
264### statement timeouts
265
266**symptoms:**
267- `asyncpg.exceptions.QueryCanceledError`
268- specific queries timing out
269
270**diagnosis:**
271- identify slow queries in logs
272- check query execution plans
273- review database indexes
274
275**solutions:**
2761. optimize slow queries
2772. add database indexes
2783. increase `DATABASE_STATEMENT_TIMEOUT` if queries are legitimately slow
2794. consider background processing for long operations
280
281## implementation details
282
283### asyncpg-specific configuration
284
285when using `postgresql+asyncpg://` URLs, additional connection settings are applied:
286
287```python
288connect_args = {
289    # unique prepared statement names per connection
290    "prepared_statement_name_func": lambda: f"__asyncpg_{uuid.uuid4()}__",
291
292    # disable statement caching (prevents prepared statement conflicts)
293    "statement_cache_size": 0,
294    "prepared_statement_cache_size": 0,
295
296    # statement timeout
297    "command_timeout": DATABASE_STATEMENT_TIMEOUT,
298
299    # connection timeout
300    "timeout": DATABASE_CONNECTION_TIMEOUT,
301}
302```
303
304these settings prevent prepared statement caching issues in asyncpg while maintaining performance.
305
306### LIFO pool strategy
307
308connections are returned in LIFO (last-in, first-out) order:
309
310```python
311pool_use_lifo=True
312```
313
314benefits:
315- recently used connections are more likely to still be valid
316- helps shed excess connections after load spikes
317- increases likelihood of connection cache hits
318
319## references
320
321- implementation: `src/backend/utilities/database.py`
322- settings: `src/backend/config.py`
323- sqlalchemy pooling: https://docs.sqlalchemy.org/en/20/core/pooling.html
324- asyncpg connections: https://magicstack.github.io/asyncpg/current/api/index.html#connection