# database connection pooling configuration and best practices for managing database connections in production. ## overview plyr.fm uses SQLAlchemy's connection pooling to manage PostgreSQL connections efficiently. proper pool configuration is critical for: - preventing connection exhaustion under load - failing fast when the database is slow or unresponsive - optimizing resource usage across concurrent requests ## configuration all settings are configurable via environment variables and defined in `src/backend/config.py`. ### timeouts ```bash # how long a single SQL query can run before being killed (default: 10s) DATABASE_STATEMENT_TIMEOUT=10.0 # how long to wait when establishing a new database connection (default: 10s) # set higher than Neon cold start latency (~5-10s) to allow wake-up DATABASE_CONNECTION_TIMEOUT=10.0 # how long to wait for an available connection from the pool (default: = connection_timeout) # this is automatically set to match DATABASE_CONNECTION_TIMEOUT ``` **why these matter:** - **statement_timeout**: prevents runaway queries from holding connections indefinitely. set based on your slowest expected query. - **connection_timeout**: fails fast when the database is slow or unreachable. set higher than Neon cold start latency (5-10s) to allow serverless databases to wake up after idle periods. - **pool_timeout**: fails fast when all connections are busy. without this, requests wait forever when the pool is exhausted. ### connection pool sizing ```bash # number of persistent connections to maintain (default: 10) DATABASE_POOL_SIZE=10 # additional connections to create on demand when pool is exhausted (default: 5) DATABASE_MAX_OVERFLOW=5 # how long before recycling a connection, in seconds (default: 7200 = 2 hours) DATABASE_POOL_RECYCLE=7200 # verify connection health before using from pool (default: true) DATABASE_POOL_PRE_PING=true ``` **sizing considerations:** total max connections = `pool_size` + `max_overflow` = 15 by default **pool_size:** - too small: connection contention, requests wait for available connections - too large: wastes memory and database resources - default of 10 handles Neon cold start scenarios where multiple requests arrive after idle periods **max_overflow:** - provides burst capacity for traffic spikes - default of 5 allows 15 total connections under peak load - connections beyond pool_size are closed when returned (not kept idle) **pool_recycle:** - prevents stale connections from lingering - should be less than your database's connection timeout - 2 hours is a safe default for most PostgreSQL configurations **pool_pre_ping:** - adds small overhead (SELECT 1) before each connection use - prevents using connections that were closed by the database - recommended for production to avoid connection errors ## Neon serverless considerations plyr.fm uses Neon PostgreSQL, which scales to zero after periods of inactivity. this introduces **cold start latency** that affects connection pooling: ### the cold start problem 1. site is idle for several minutes → Neon scales down 2. first request arrives → Neon needs 5-10s to wake up 3. if pool_size is too small, all connections hang waiting for Neon 4. new requests can't get connections → 500 errors ### how we mitigate this **larger connection pool (pool_size=10, max_overflow=5):** - allows 15 concurrent requests to wait for Neon wake-up - prevents pool exhaustion during cold start **appropriate connection timeout (10s):** - long enough to wait for Neon cold start (~5-10s) - short enough to fail fast on true database outages **queue listener heartbeat:** - background task pings database every 5s - detects connection death before user requests fail - triggers reconnection with exponential backoff ### incident history - **2025-11-17**: first pool exhaustion outage - queue listener hung indefinitely on slow database. fix: added 15s timeout to asyncpg.connect() in queue service. - **2025-12-02**: cold start recurrence - 5 minute idle period caused Neon to scale down. first 5 requests after wake-up hung for 3-5 minutes each, exhausting pool. fix: increased pool_size to 10, max_overflow to 5, connection_timeout to 10s. ## production best practices ### current deployment (Neon serverless) ```bash # pool sized for cold start scenarios DATABASE_POOL_SIZE=10 DATABASE_MAX_OVERFLOW=5 # timeout accounts for Neon wake-up latency DATABASE_STATEMENT_TIMEOUT=10.0 DATABASE_CONNECTION_TIMEOUT=10.0 # standard recycle DATABASE_POOL_RECYCLE=7200 ``` this configuration: - handles 15 concurrent requests during Neon cold start - fails fast (10s) on true database issues - balances resource usage with reliability ### if seeing pool exhaustion **option 1: increase pool size** ```bash DATABASE_POOL_SIZE=15 DATABASE_MAX_OVERFLOW=5 ``` pros: more concurrent capacity during cold starts cons: more database connections when warm **option 2: increase overflow** ```bash DATABASE_POOL_SIZE=10 DATABASE_MAX_OVERFLOW=10 # allows 20 total under burst ``` pros: higher burst capacity, same baseline cons: less predictable peak resource usage ### tuning statement timeout adjust based on your query patterns: ```bash # strict timeout for API workloads DATABASE_STATEMENT_TIMEOUT=2.0 # lenient for long-running operations (uploads, processing) DATABASE_STATEMENT_TIMEOUT=30.0 ``` analyze slow query logs to understand p99 query latency, then set timeout with appropriate headroom. ## monitoring ### what to watch 1. **pool utilization**: `engine.pool.checkedout()` / `pool_size` - alert if > 80% for sustained periods - indicates need to increase pool_size 2. **connection timeouts**: rate of pool timeout errors - indicates pool exhaustion - consider increasing pool_size or max_overflow 3. **statement timeouts**: rate of query timeout errors - indicates slow queries or inappropriate timeout - investigate slow queries, consider adjusting timeout 4. **connection errors**: failed connection attempts - database availability issues - network problems ### observability the engine is instrumented with logfire when observability is enabled (`LOGFIRE_ENABLED=true`). this provides: - connection pool metrics - query execution times - timeout events - connection errors review logfire traces to understand: - which queries are slow - when pool exhaustion occurs - connection patterns under load ## how it works ### connection lifecycle 1. **request arrives** → acquire connection from pool 2. **execute query** → use connection (with statement_timeout) 3. **request complete** → return connection to pool 4. **recycle threshold** → close and replace old connections ### failure modes **pool exhausted:** - all `pool_size` + `max_overflow` connections in use - new requests wait for `pool_timeout` seconds - timeout → 503 error (fail fast) **database slow:** - connection attempt exceeds `connection_timeout` - timeout → fail fast, can retry **query too slow:** - query exceeds `statement_timeout` - query killed, connection returned to pool - prevents one slow query from blocking others **connection died:** - `pool_pre_ping` detects dead connection - connection discarded, new one created - prevents errors from using stale connections ## troubleshooting ### 503 errors (pool exhausted) **symptoms:** - `QueuePool limit of size N overflow M reached` - requests timing out waiting for connections **diagnosis:** ```python # check pool status engine = get_engine() print(f"checked out: {engine.pool.checkedout()}") print(f"pool size: {engine.pool.size()}") ``` **solutions:** 1. increase `DATABASE_POOL_SIZE` 2. add `DATABASE_MAX_OVERFLOW` for burst capacity 3. investigate slow queries holding connections 4. check for connection leaks (connections not being returned) ### connection timeouts **symptoms:** - `asyncpg.exceptions.ConnectionTimeoutError` - requests failing to connect to database **diagnosis:** - check database availability - check network latency - review logfire connection traces **solutions:** 1. verify database is responsive 2. increase `DATABASE_CONNECTION_TIMEOUT` if network latency is high 3. investigate database performance (CPU, I/O) ### statement timeouts **symptoms:** - `asyncpg.exceptions.QueryCanceledError` - specific queries timing out **diagnosis:** - identify slow queries in logs - check query execution plans - review database indexes **solutions:** 1. optimize slow queries 2. add database indexes 3. increase `DATABASE_STATEMENT_TIMEOUT` if queries are legitimately slow 4. consider background processing for long operations ## implementation details ### asyncpg-specific configuration when using `postgresql+asyncpg://` URLs, additional connection settings are applied: ```python connect_args = { # unique prepared statement names per connection "prepared_statement_name_func": lambda: f"__asyncpg_{uuid.uuid4()}__", # disable statement caching (prevents prepared statement conflicts) "statement_cache_size": 0, "prepared_statement_cache_size": 0, # statement timeout "command_timeout": DATABASE_STATEMENT_TIMEOUT, # connection timeout "timeout": DATABASE_CONNECTION_TIMEOUT, } ``` these settings prevent prepared statement caching issues in asyncpg while maintaining performance. ### LIFO pool strategy connections are returned in LIFO (last-in, first-out) order: ```python pool_use_lifo=True ``` benefits: - recently used connections are more likely to still be valid - helps shed excess connections after load spikes - increases likelihood of connection cache hits ## references - implementation: `src/backend/utilities/database.py` - settings: `src/backend/config.py` - sqlalchemy pooling: https://docs.sqlalchemy.org/en/20/core/pooling.html - asyncpg connections: https://magicstack.github.io/asyncpg/current/api/index.html#connection