database connection pooling#
configuration and best practices for managing database connections in production.
overview#
plyr.fm uses SQLAlchemy's connection pooling to manage PostgreSQL connections efficiently. proper pool configuration is critical for:
- preventing connection exhaustion under load
- failing fast when the database is slow or unresponsive
- optimizing resource usage across concurrent requests
configuration#
all settings are configurable via environment variables and defined in src/backend/config.py.
timeouts#
# how long a single SQL query can run before being killed (default: 10s)
DATABASE_STATEMENT_TIMEOUT=10.0
# how long to wait when establishing a new database connection (default: 10s)
# set higher than Neon cold start latency (~5-10s) to allow wake-up
DATABASE_CONNECTION_TIMEOUT=10.0
# how long to wait for an available connection from the pool (default: = connection_timeout)
# this is automatically set to match DATABASE_CONNECTION_TIMEOUT
why these matter:
- statement_timeout: prevents runaway queries from holding connections indefinitely. set based on your slowest expected query.
- connection_timeout: fails fast when the database is slow or unreachable. set higher than Neon cold start latency (5-10s) to allow serverless databases to wake up after idle periods.
- pool_timeout: fails fast when all connections are busy. without this, requests wait forever when the pool is exhausted.
connection pool sizing#
# number of persistent connections to maintain (default: 10)
DATABASE_POOL_SIZE=10
# additional connections to create on demand when pool is exhausted (default: 5)
DATABASE_MAX_OVERFLOW=5
# how long before recycling a connection, in seconds (default: 7200 = 2 hours)
DATABASE_POOL_RECYCLE=7200
# verify connection health before using from pool (default: true)
DATABASE_POOL_PRE_PING=true
sizing considerations:
total max connections = pool_size + max_overflow = 15 by default
pool_size:
- too small: connection contention, requests wait for available connections
- too large: wastes memory and database resources
- default of 10 handles Neon cold start scenarios where multiple requests arrive after idle periods
max_overflow:
- provides burst capacity for traffic spikes
- default of 5 allows 15 total connections under peak load
- connections beyond pool_size are closed when returned (not kept idle)
pool_recycle:
- prevents stale connections from lingering
- should be less than your database's connection timeout
- 2 hours is a safe default for most PostgreSQL configurations
pool_pre_ping:
- adds small overhead (SELECT 1) before each connection use
- prevents using connections that were closed by the database
- recommended for production to avoid connection errors
Neon serverless considerations#
plyr.fm uses Neon PostgreSQL, which scales to zero after periods of inactivity. this introduces cold start latency that affects connection pooling:
the cold start problem#
- site is idle for several minutes → Neon scales down
- first request arrives → Neon needs 5-10s to wake up
- if pool_size is too small, all connections hang waiting for Neon
- new requests can't get connections → 500 errors
how we mitigate this#
larger connection pool (pool_size=10, max_overflow=5):
- allows 15 concurrent requests to wait for Neon wake-up
- prevents pool exhaustion during cold start
appropriate connection timeout (10s):
- long enough to wait for Neon cold start (~5-10s)
- short enough to fail fast on true database outages
queue listener heartbeat:
- background task pings database every 5s
- detects connection death before user requests fail
- triggers reconnection with exponential backoff
incident history#
- 2025-11-17: first pool exhaustion outage - queue listener hung indefinitely on slow database. fix: added 15s timeout to asyncpg.connect() in queue service.
- 2025-12-02: cold start recurrence - 5 minute idle period caused Neon to scale down. first 5 requests after wake-up hung for 3-5 minutes each, exhausting pool. fix: increased pool_size to 10, max_overflow to 5, connection_timeout to 10s.
production best practices#
current deployment (Neon serverless)#
# pool sized for cold start scenarios
DATABASE_POOL_SIZE=10
DATABASE_MAX_OVERFLOW=5
# timeout accounts for Neon wake-up latency
DATABASE_STATEMENT_TIMEOUT=10.0
DATABASE_CONNECTION_TIMEOUT=10.0
# standard recycle
DATABASE_POOL_RECYCLE=7200
this configuration:
- handles 15 concurrent requests during Neon cold start
- fails fast (10s) on true database issues
- balances resource usage with reliability
if seeing pool exhaustion#
option 1: increase pool size
DATABASE_POOL_SIZE=15
DATABASE_MAX_OVERFLOW=5
pros: more concurrent capacity during cold starts cons: more database connections when warm
option 2: increase overflow
DATABASE_POOL_SIZE=10
DATABASE_MAX_OVERFLOW=10 # allows 20 total under burst
pros: higher burst capacity, same baseline cons: less predictable peak resource usage
tuning statement timeout#
adjust based on your query patterns:
# strict timeout for API workloads
DATABASE_STATEMENT_TIMEOUT=2.0
# lenient for long-running operations (uploads, processing)
DATABASE_STATEMENT_TIMEOUT=30.0
analyze slow query logs to understand p99 query latency, then set timeout with appropriate headroom.
monitoring#
what to watch#
-
pool utilization:
engine.pool.checkedout()/pool_size- alert if > 80% for sustained periods
- indicates need to increase pool_size
-
connection timeouts: rate of pool timeout errors
- indicates pool exhaustion
- consider increasing pool_size or max_overflow
-
statement timeouts: rate of query timeout errors
- indicates slow queries or inappropriate timeout
- investigate slow queries, consider adjusting timeout
-
connection errors: failed connection attempts
- database availability issues
- network problems
observability#
the engine is instrumented with logfire when observability is enabled (LOGFIRE_ENABLED=true). this provides:
- connection pool metrics
- query execution times
- timeout events
- connection errors
review logfire traces to understand:
- which queries are slow
- when pool exhaustion occurs
- connection patterns under load
how it works#
connection lifecycle#
- request arrives → acquire connection from pool
- execute query → use connection (with statement_timeout)
- request complete → return connection to pool
- recycle threshold → close and replace old connections
failure modes#
pool exhausted:
- all
pool_size+max_overflowconnections in use - new requests wait for
pool_timeoutseconds - timeout → 503 error (fail fast)
database slow:
- connection attempt exceeds
connection_timeout - timeout → fail fast, can retry
query too slow:
- query exceeds
statement_timeout - query killed, connection returned to pool
- prevents one slow query from blocking others
connection died:
pool_pre_pingdetects dead connection- connection discarded, new one created
- prevents errors from using stale connections
troubleshooting#
503 errors (pool exhausted)#
symptoms:
QueuePool limit of size N overflow M reached- requests timing out waiting for connections
diagnosis:
# check pool status
engine = get_engine()
print(f"checked out: {engine.pool.checkedout()}")
print(f"pool size: {engine.pool.size()}")
solutions:
- increase
DATABASE_POOL_SIZE - add
DATABASE_MAX_OVERFLOWfor burst capacity - investigate slow queries holding connections
- check for connection leaks (connections not being returned)
connection timeouts#
symptoms:
asyncpg.exceptions.ConnectionTimeoutError- requests failing to connect to database
diagnosis:
- check database availability
- check network latency
- review logfire connection traces
solutions:
- verify database is responsive
- increase
DATABASE_CONNECTION_TIMEOUTif network latency is high - investigate database performance (CPU, I/O)
statement timeouts#
symptoms:
asyncpg.exceptions.QueryCanceledError- specific queries timing out
diagnosis:
- identify slow queries in logs
- check query execution plans
- review database indexes
solutions:
- optimize slow queries
- add database indexes
- increase
DATABASE_STATEMENT_TIMEOUTif queries are legitimately slow - consider background processing for long operations
implementation details#
asyncpg-specific configuration#
when using postgresql+asyncpg:// URLs, additional connection settings are applied:
connect_args = {
# unique prepared statement names per connection
"prepared_statement_name_func": lambda: f"__asyncpg_{uuid.uuid4()}__",
# disable statement caching (prevents prepared statement conflicts)
"statement_cache_size": 0,
"prepared_statement_cache_size": 0,
# statement timeout
"command_timeout": DATABASE_STATEMENT_TIMEOUT,
# connection timeout
"timeout": DATABASE_CONNECTION_TIMEOUT,
}
these settings prevent prepared statement caching issues in asyncpg while maintaining performance.
LIFO pool strategy#
connections are returned in LIFO (last-in, first-out) order:
pool_use_lifo=True
benefits:
- recently used connections are more likely to still be valid
- helps shed excess connections after load spikes
- increases likelihood of connection cache hits
references#
- implementation:
src/backend/utilities/database.py - settings:
src/backend/config.py - sqlalchemy pooling: https://docs.sqlalchemy.org/en/20/core/pooling.html
- asyncpg connections: https://magicstack.github.io/asyncpg/current/api/index.html#connection