1# database connection pooling 2 3configuration and best practices for managing database connections in production. 4 5## overview 6 7plyr.fm uses SQLAlchemy's connection pooling to manage PostgreSQL connections efficiently. proper pool configuration is critical for: 8 9- preventing connection exhaustion under load 10- failing fast when the database is slow or unresponsive 11- optimizing resource usage across concurrent requests 12 13## configuration 14 15all settings are configurable via environment variables and defined in `src/backend/config.py`. 16 17### timeouts 18 19```bash 20# how long a single SQL query can run before being killed (default: 10s) 21DATABASE_STATEMENT_TIMEOUT=10.0 22 23# how long to wait when establishing a new database connection (default: 10s) 24# set higher than Neon cold start latency (~5-10s) to allow wake-up 25DATABASE_CONNECTION_TIMEOUT=10.0 26 27# how long to wait for an available connection from the pool (default: = connection_timeout) 28# this is automatically set to match DATABASE_CONNECTION_TIMEOUT 29``` 30 31**why these matter:** 32 33- **statement_timeout**: prevents runaway queries from holding connections indefinitely. set based on your slowest expected query. 34- **connection_timeout**: fails fast when the database is slow or unreachable. set higher than Neon cold start latency (5-10s) to allow serverless databases to wake up after idle periods. 35- **pool_timeout**: fails fast when all connections are busy. without this, requests wait forever when the pool is exhausted. 36 37### connection pool sizing 38 39```bash 40# number of persistent connections to maintain (default: 10) 41DATABASE_POOL_SIZE=10 42 43# additional connections to create on demand when pool is exhausted (default: 5) 44DATABASE_MAX_OVERFLOW=5 45 46# how long before recycling a connection, in seconds (default: 7200 = 2 hours) 47DATABASE_POOL_RECYCLE=7200 48 49# verify connection health before using from pool (default: true) 50DATABASE_POOL_PRE_PING=true 51``` 52 53**sizing considerations:** 54 55total max connections = `pool_size` + `max_overflow` = 15 by default 56 57**pool_size:** 58- too small: connection contention, requests wait for available connections 59- too large: wastes memory and database resources 60- default of 10 handles Neon cold start scenarios where multiple requests arrive after idle periods 61 62**max_overflow:** 63- provides burst capacity for traffic spikes 64- default of 5 allows 15 total connections under peak load 65- connections beyond pool_size are closed when returned (not kept idle) 66 67**pool_recycle:** 68- prevents stale connections from lingering 69- should be less than your database's connection timeout 70- 2 hours is a safe default for most PostgreSQL configurations 71 72**pool_pre_ping:** 73- adds small overhead (SELECT 1) before each connection use 74- prevents using connections that were closed by the database 75- recommended for production to avoid connection errors 76 77## Neon serverless considerations 78 79plyr.fm uses Neon PostgreSQL, which scales to zero after periods of inactivity. this introduces **cold start latency** that affects connection pooling: 80 81### the cold start problem 82 831. site is idle for several minutes → Neon scales down 842. first request arrives → Neon needs 5-10s to wake up 853. if pool_size is too small, all connections hang waiting for Neon 864. new requests can't get connections → 500 errors 87 88### how we mitigate this 89 90**larger connection pool (pool_size=10, max_overflow=5):** 91- allows 15 concurrent requests to wait for Neon wake-up 92- prevents pool exhaustion during cold start 93 94**appropriate connection timeout (10s):** 95- long enough to wait for Neon cold start (~5-10s) 96- short enough to fail fast on true database outages 97 98**queue listener heartbeat:** 99- background task pings database every 5s 100- detects connection death before user requests fail 101- triggers reconnection with exponential backoff 102 103### incident history 104 105- **2025-11-17**: first pool exhaustion outage - queue listener hung indefinitely on slow database. fix: added 15s timeout to asyncpg.connect() in queue service. 106- **2025-12-02**: cold start recurrence - 5 minute idle period caused Neon to scale down. first 5 requests after wake-up hung for 3-5 minutes each, exhausting pool. fix: increased pool_size to 10, max_overflow to 5, connection_timeout to 10s. 107 108## production best practices 109 110### current deployment (Neon serverless) 111 112```bash 113# pool sized for cold start scenarios 114DATABASE_POOL_SIZE=10 115DATABASE_MAX_OVERFLOW=5 116 117# timeout accounts for Neon wake-up latency 118DATABASE_STATEMENT_TIMEOUT=10.0 119DATABASE_CONNECTION_TIMEOUT=10.0 120 121# standard recycle 122DATABASE_POOL_RECYCLE=7200 123``` 124 125this configuration: 126- handles 15 concurrent requests during Neon cold start 127- fails fast (10s) on true database issues 128- balances resource usage with reliability 129 130### if seeing pool exhaustion 131 132**option 1: increase pool size** 133```bash 134DATABASE_POOL_SIZE=15 135DATABASE_MAX_OVERFLOW=5 136``` 137pros: more concurrent capacity during cold starts 138cons: more database connections when warm 139 140**option 2: increase overflow** 141```bash 142DATABASE_POOL_SIZE=10 143DATABASE_MAX_OVERFLOW=10 # allows 20 total under burst 144``` 145pros: higher burst capacity, same baseline 146cons: less predictable peak resource usage 147 148### tuning statement timeout 149 150adjust based on your query patterns: 151 152```bash 153# strict timeout for API workloads 154DATABASE_STATEMENT_TIMEOUT=2.0 155 156# lenient for long-running operations (uploads, processing) 157DATABASE_STATEMENT_TIMEOUT=30.0 158``` 159 160analyze slow query logs to understand p99 query latency, then set timeout with appropriate headroom. 161 162## monitoring 163 164### what to watch 165 1661. **pool utilization**: `engine.pool.checkedout()` / `pool_size` 167 - alert if > 80% for sustained periods 168 - indicates need to increase pool_size 169 1702. **connection timeouts**: rate of pool timeout errors 171 - indicates pool exhaustion 172 - consider increasing pool_size or max_overflow 173 1743. **statement timeouts**: rate of query timeout errors 175 - indicates slow queries or inappropriate timeout 176 - investigate slow queries, consider adjusting timeout 177 1784. **connection errors**: failed connection attempts 179 - database availability issues 180 - network problems 181 182### observability 183 184the engine is instrumented with logfire when observability is enabled (`LOGFIRE_ENABLED=true`). this provides: 185 186- connection pool metrics 187- query execution times 188- timeout events 189- connection errors 190 191review logfire traces to understand: 192- which queries are slow 193- when pool exhaustion occurs 194- connection patterns under load 195 196## how it works 197 198### connection lifecycle 199 2001. **request arrives** → acquire connection from pool 2012. **execute query** → use connection (with statement_timeout) 2023. **request complete** → return connection to pool 2034. **recycle threshold** → close and replace old connections 204 205### failure modes 206 207**pool exhausted:** 208- all `pool_size` + `max_overflow` connections in use 209- new requests wait for `pool_timeout` seconds 210- timeout → 503 error (fail fast) 211 212**database slow:** 213- connection attempt exceeds `connection_timeout` 214- timeout → fail fast, can retry 215 216**query too slow:** 217- query exceeds `statement_timeout` 218- query killed, connection returned to pool 219- prevents one slow query from blocking others 220 221**connection died:** 222- `pool_pre_ping` detects dead connection 223- connection discarded, new one created 224- prevents errors from using stale connections 225 226## troubleshooting 227 228### 503 errors (pool exhausted) 229 230**symptoms:** 231- `QueuePool limit of size N overflow M reached` 232- requests timing out waiting for connections 233 234**diagnosis:** 235```python 236# check pool status 237engine = get_engine() 238print(f"checked out: {engine.pool.checkedout()}") 239print(f"pool size: {engine.pool.size()}") 240``` 241 242**solutions:** 2431. increase `DATABASE_POOL_SIZE` 2442. add `DATABASE_MAX_OVERFLOW` for burst capacity 2453. investigate slow queries holding connections 2464. check for connection leaks (connections not being returned) 247 248### connection timeouts 249 250**symptoms:** 251- `asyncpg.exceptions.ConnectionTimeoutError` 252- requests failing to connect to database 253 254**diagnosis:** 255- check database availability 256- check network latency 257- review logfire connection traces 258 259**solutions:** 2601. verify database is responsive 2612. increase `DATABASE_CONNECTION_TIMEOUT` if network latency is high 2623. investigate database performance (CPU, I/O) 263 264### statement timeouts 265 266**symptoms:** 267- `asyncpg.exceptions.QueryCanceledError` 268- specific queries timing out 269 270**diagnosis:** 271- identify slow queries in logs 272- check query execution plans 273- review database indexes 274 275**solutions:** 2761. optimize slow queries 2772. add database indexes 2783. increase `DATABASE_STATEMENT_TIMEOUT` if queries are legitimately slow 2794. consider background processing for long operations 280 281## implementation details 282 283### asyncpg-specific configuration 284 285when using `postgresql+asyncpg://` URLs, additional connection settings are applied: 286 287```python 288connect_args = { 289 # unique prepared statement names per connection 290 "prepared_statement_name_func": lambda: f"__asyncpg_{uuid.uuid4()}__", 291 292 # disable statement caching (prevents prepared statement conflicts) 293 "statement_cache_size": 0, 294 "prepared_statement_cache_size": 0, 295 296 # statement timeout 297 "command_timeout": DATABASE_STATEMENT_TIMEOUT, 298 299 # connection timeout 300 "timeout": DATABASE_CONNECTION_TIMEOUT, 301} 302``` 303 304these settings prevent prepared statement caching issues in asyncpg while maintaining performance. 305 306### LIFO pool strategy 307 308connections are returned in LIFO (last-in, first-out) order: 309 310```python 311pool_use_lifo=True 312``` 313 314benefits: 315- recently used connections are more likely to still be valid 316- helps shed excess connections after load spikes 317- increases likelihood of connection cache hits 318 319## references 320 321- implementation: `src/backend/utilities/database.py` 322- settings: `src/backend/config.py` 323- sqlalchemy pooling: https://docs.sqlalchemy.org/en/20/core/pooling.html 324- asyncpg connections: https://magicstack.github.io/asyncpg/current/api/index.html#connection