database connection pooling#

configuration and best practices for managing database connections in production.

overview#

plyr.fm uses SQLAlchemy's connection pooling to manage PostgreSQL connections efficiently. proper pool configuration is critical for:

preventing connection exhaustion under load
failing fast when the database is slow or unresponsive
optimizing resource usage across concurrent requests

configuration#

all settings are configurable via environment variables and defined in src/backend/config.py.

timeouts#

# how long a single SQL query can run before being killed (default: 10s)
DATABASE_STATEMENT_TIMEOUT=10.0

# how long to wait when establishing a new database connection (default: 10s)
# set higher than Neon cold start latency (~5-10s) to allow wake-up
DATABASE_CONNECTION_TIMEOUT=10.0

# how long to wait for an available connection from the pool (default: = connection_timeout)
# this is automatically set to match DATABASE_CONNECTION_TIMEOUT

why these matter:

statement_timeout: prevents runaway queries from holding connections indefinitely. set based on your slowest expected query.
connection_timeout: fails fast when the database is slow or unreachable. set higher than Neon cold start latency (5-10s) to allow serverless databases to wake up after idle periods.
pool_timeout: fails fast when all connections are busy. without this, requests wait forever when the pool is exhausted.

connection pool sizing#

# number of persistent connections to maintain (default: 10)
DATABASE_POOL_SIZE=10

# additional connections to create on demand when pool is exhausted (default: 5)
DATABASE_MAX_OVERFLOW=5

# how long before recycling a connection, in seconds (default: 7200 = 2 hours)
DATABASE_POOL_RECYCLE=7200

# verify connection health before using from pool (default: true)
DATABASE_POOL_PRE_PING=true

sizing considerations:

total max connections = pool_size + max_overflow = 15 by default

pool_size:

too small: connection contention, requests wait for available connections
too large: wastes memory and database resources
default of 10 handles Neon cold start scenarios where multiple requests arrive after idle periods

max_overflow:

provides burst capacity for traffic spikes
default of 5 allows 15 total connections under peak load
connections beyond pool_size are closed when returned (not kept idle)

pool_recycle:

prevents stale connections from lingering
should be less than your database's connection timeout
2 hours is a safe default for most PostgreSQL configurations

pool_pre_ping:

adds small overhead (SELECT 1) before each connection use
prevents using connections that were closed by the database
recommended for production to avoid connection errors

Neon serverless considerations#

plyr.fm uses Neon PostgreSQL, which scales to zero after periods of inactivity. this introduces cold start latency that affects connection pooling:

the cold start problem#

site is idle for several minutes → Neon scales down
first request arrives → Neon needs 5-10s to wake up
if pool_size is too small, all connections hang waiting for Neon
new requests can't get connections → 500 errors

how we mitigate this#

larger connection pool (pool_size=10, max_overflow=5):

allows 15 concurrent requests to wait for Neon wake-up
prevents pool exhaustion during cold start

appropriate connection timeout (10s):

long enough to wait for Neon cold start (~5-10s)
short enough to fail fast on true database outages

queue listener heartbeat:

background task pings database every 5s
detects connection death before user requests fail
triggers reconnection with exponential backoff

incident history#

2025-11-17: first pool exhaustion outage - queue listener hung indefinitely on slow database. fix: added 15s timeout to asyncpg.connect() in queue service.
2025-12-02: cold start recurrence - 5 minute idle period caused Neon to scale down. first 5 requests after wake-up hung for 3-5 minutes each, exhausting pool. fix: increased pool_size to 10, max_overflow to 5, connection_timeout to 10s.

production best practices#

current deployment (Neon serverless)#

# pool sized for cold start scenarios
DATABASE_POOL_SIZE=10
DATABASE_MAX_OVERFLOW=5

# timeout accounts for Neon wake-up latency
DATABASE_STATEMENT_TIMEOUT=10.0
DATABASE_CONNECTION_TIMEOUT=10.0

# standard recycle
DATABASE_POOL_RECYCLE=7200

this configuration:

handles 15 concurrent requests during Neon cold start
fails fast (10s) on true database issues
balances resource usage with reliability

if seeing pool exhaustion#

option 1: increase pool size

DATABASE_POOL_SIZE=15
DATABASE_MAX_OVERFLOW=5

pros: more concurrent capacity during cold starts cons: more database connections when warm

option 2: increase overflow

DATABASE_POOL_SIZE=10
DATABASE_MAX_OVERFLOW=10  # allows 20 total under burst

pros: higher burst capacity, same baseline cons: less predictable peak resource usage

tuning statement timeout#

adjust based on your query patterns:

# strict timeout for API workloads
DATABASE_STATEMENT_TIMEOUT=2.0

# lenient for long-running operations (uploads, processing)
DATABASE_STATEMENT_TIMEOUT=30.0

analyze slow query logs to understand p99 query latency, then set timeout with appropriate headroom.

monitoring#

what to watch#

pool utilization: engine.pool.checkedout() / pool_size
- alert if > 80% for sustained periods
- indicates need to increase pool_size
connection timeouts: rate of pool timeout errors
- indicates pool exhaustion
- consider increasing pool_size or max_overflow
statement timeouts: rate of query timeout errors
- indicates slow queries or inappropriate timeout
- investigate slow queries, consider adjusting timeout
connection errors: failed connection attempts
- database availability issues
- network problems

observability#

the engine is instrumented with logfire when observability is enabled (LOGFIRE_ENABLED=true). this provides:

connection pool metrics
query execution times
timeout events
connection errors

review logfire traces to understand:

which queries are slow
when pool exhaustion occurs
connection patterns under load

how it works#

connection lifecycle#

request arrives → acquire connection from pool
execute query → use connection (with statement_timeout)
request complete → return connection to pool
recycle threshold → close and replace old connections

failure modes#

pool exhausted:

all pool_size + max_overflow connections in use
new requests wait for pool_timeout seconds
timeout → 503 error (fail fast)

database slow:

connection attempt exceeds connection_timeout
timeout → fail fast, can retry

query too slow:

query exceeds statement_timeout
query killed, connection returned to pool
prevents one slow query from blocking others

connection died:

pool_pre_ping detects dead connection
connection discarded, new one created
prevents errors from using stale connections

troubleshooting#

503 errors (pool exhausted)#

symptoms:

QueuePool limit of size N overflow M reached
requests timing out waiting for connections

diagnosis:

# check pool status
engine = get_engine()
print(f"checked out: {engine.pool.checkedout()}")
print(f"pool size: {engine.pool.size()}")

solutions:

increase DATABASE_POOL_SIZE
add DATABASE_MAX_OVERFLOW for burst capacity
investigate slow queries holding connections
check for connection leaks (connections not being returned)

connection timeouts#

symptoms:

asyncpg.exceptions.ConnectionTimeoutError
requests failing to connect to database

diagnosis:

check database availability
check network latency
review logfire connection traces

solutions:

verify database is responsive
increase DATABASE_CONNECTION_TIMEOUT if network latency is high
investigate database performance (CPU, I/O)

statement timeouts#

symptoms:

asyncpg.exceptions.QueryCanceledError
specific queries timing out

diagnosis:

identify slow queries in logs
check query execution plans
review database indexes

solutions:

optimize slow queries
add database indexes
increase DATABASE_STATEMENT_TIMEOUT if queries are legitimately slow
consider background processing for long operations

implementation details#

asyncpg-specific configuration#

when using postgresql+asyncpg:// URLs, additional connection settings are applied:

connect_args = {
    # unique prepared statement names per connection
    "prepared_statement_name_func": lambda: f"__asyncpg_{uuid.uuid4()}__",

    # disable statement caching (prevents prepared statement conflicts)
    "statement_cache_size": 0,
    "prepared_statement_cache_size": 0,

    # statement timeout
    "command_timeout": DATABASE_STATEMENT_TIMEOUT,

    # connection timeout
    "timeout": DATABASE_CONNECTION_TIMEOUT,
}

these settings prevent prepared statement caching issues in asyncpg while maintaining performance.

LIFO pool strategy#

connections are returned in LIFO (last-in, first-out) order:

pool_use_lifo=True

benefits:

recently used connections are more likely to still be valid
helps shed excess connections after load spikes
increases likelihood of connection cache hits

references#

implementation: src/backend/utilities/database.py
settings: src/backend/config.py
sqlalchemy pooling: https://docs.sqlalchemy.org/en/20/core/pooling.html
asyncpg connections: https://magicstack.github.io/asyncpg/current/api/index.html#connection