database connection pooling#

configuration and best practices for managing database connections in production.

overview#

plyr.fm uses SQLAlchemy's connection pooling to manage PostgreSQL connections efficiently. proper pool configuration is critical for:

  • preventing connection exhaustion under load
  • failing fast when the database is slow or unresponsive
  • optimizing resource usage across concurrent requests

configuration#

all settings are configurable via environment variables and defined in src/backend/config.py.

timeouts#

# how long a single SQL query can run before being killed (default: 10s)
DATABASE_STATEMENT_TIMEOUT=10.0

# how long to wait when establishing a new database connection (default: 10s)
# set higher than Neon cold start latency (~5-10s) to allow wake-up
DATABASE_CONNECTION_TIMEOUT=10.0

# how long to wait for an available connection from the pool (default: = connection_timeout)
# this is automatically set to match DATABASE_CONNECTION_TIMEOUT

why these matter:

  • statement_timeout: prevents runaway queries from holding connections indefinitely. set based on your slowest expected query.
  • connection_timeout: fails fast when the database is slow or unreachable. set higher than Neon cold start latency (5-10s) to allow serverless databases to wake up after idle periods.
  • pool_timeout: fails fast when all connections are busy. without this, requests wait forever when the pool is exhausted.

connection pool sizing#

# number of persistent connections to maintain (default: 10)
DATABASE_POOL_SIZE=10

# additional connections to create on demand when pool is exhausted (default: 5)
DATABASE_MAX_OVERFLOW=5

# how long before recycling a connection, in seconds (default: 7200 = 2 hours)
DATABASE_POOL_RECYCLE=7200

# verify connection health before using from pool (default: true)
DATABASE_POOL_PRE_PING=true

sizing considerations:

total max connections = pool_size + max_overflow = 15 by default

pool_size:

  • too small: connection contention, requests wait for available connections
  • too large: wastes memory and database resources
  • default of 10 handles Neon cold start scenarios where multiple requests arrive after idle periods

max_overflow:

  • provides burst capacity for traffic spikes
  • default of 5 allows 15 total connections under peak load
  • connections beyond pool_size are closed when returned (not kept idle)

pool_recycle:

  • prevents stale connections from lingering
  • should be less than your database's connection timeout
  • 2 hours is a safe default for most PostgreSQL configurations

pool_pre_ping:

  • adds small overhead (SELECT 1) before each connection use
  • prevents using connections that were closed by the database
  • recommended for production to avoid connection errors

Neon serverless considerations#

plyr.fm uses Neon PostgreSQL, which scales to zero after periods of inactivity. this introduces cold start latency that affects connection pooling:

the cold start problem#

  1. site is idle for several minutes → Neon scales down
  2. first request arrives → Neon needs 5-10s to wake up
  3. if pool_size is too small, all connections hang waiting for Neon
  4. new requests can't get connections → 500 errors

how we mitigate this#

larger connection pool (pool_size=10, max_overflow=5):

  • allows 15 concurrent requests to wait for Neon wake-up
  • prevents pool exhaustion during cold start

appropriate connection timeout (10s):

  • long enough to wait for Neon cold start (~5-10s)
  • short enough to fail fast on true database outages

queue listener heartbeat:

  • background task pings database every 5s
  • detects connection death before user requests fail
  • triggers reconnection with exponential backoff

incident history#

  • 2025-11-17: first pool exhaustion outage - queue listener hung indefinitely on slow database. fix: added 15s timeout to asyncpg.connect() in queue service.
  • 2025-12-02: cold start recurrence - 5 minute idle period caused Neon to scale down. first 5 requests after wake-up hung for 3-5 minutes each, exhausting pool. fix: increased pool_size to 10, max_overflow to 5, connection_timeout to 10s.

production best practices#

current deployment (Neon serverless)#

# pool sized for cold start scenarios
DATABASE_POOL_SIZE=10
DATABASE_MAX_OVERFLOW=5

# timeout accounts for Neon wake-up latency
DATABASE_STATEMENT_TIMEOUT=10.0
DATABASE_CONNECTION_TIMEOUT=10.0

# standard recycle
DATABASE_POOL_RECYCLE=7200

this configuration:

  • handles 15 concurrent requests during Neon cold start
  • fails fast (10s) on true database issues
  • balances resource usage with reliability

if seeing pool exhaustion#

option 1: increase pool size

DATABASE_POOL_SIZE=15
DATABASE_MAX_OVERFLOW=5

pros: more concurrent capacity during cold starts cons: more database connections when warm

option 2: increase overflow

DATABASE_POOL_SIZE=10
DATABASE_MAX_OVERFLOW=10  # allows 20 total under burst

pros: higher burst capacity, same baseline cons: less predictable peak resource usage

tuning statement timeout#

adjust based on your query patterns:

# strict timeout for API workloads
DATABASE_STATEMENT_TIMEOUT=2.0

# lenient for long-running operations (uploads, processing)
DATABASE_STATEMENT_TIMEOUT=30.0

analyze slow query logs to understand p99 query latency, then set timeout with appropriate headroom.

monitoring#

what to watch#

  1. pool utilization: engine.pool.checkedout() / pool_size

    • alert if > 80% for sustained periods
    • indicates need to increase pool_size
  2. connection timeouts: rate of pool timeout errors

    • indicates pool exhaustion
    • consider increasing pool_size or max_overflow
  3. statement timeouts: rate of query timeout errors

    • indicates slow queries or inappropriate timeout
    • investigate slow queries, consider adjusting timeout
  4. connection errors: failed connection attempts

    • database availability issues
    • network problems

observability#

the engine is instrumented with logfire when observability is enabled (LOGFIRE_ENABLED=true). this provides:

  • connection pool metrics
  • query execution times
  • timeout events
  • connection errors

review logfire traces to understand:

  • which queries are slow
  • when pool exhaustion occurs
  • connection patterns under load

how it works#

connection lifecycle#

  1. request arrives → acquire connection from pool
  2. execute query → use connection (with statement_timeout)
  3. request complete → return connection to pool
  4. recycle threshold → close and replace old connections

failure modes#

pool exhausted:

  • all pool_size + max_overflow connections in use
  • new requests wait for pool_timeout seconds
  • timeout → 503 error (fail fast)

database slow:

  • connection attempt exceeds connection_timeout
  • timeout → fail fast, can retry

query too slow:

  • query exceeds statement_timeout
  • query killed, connection returned to pool
  • prevents one slow query from blocking others

connection died:

  • pool_pre_ping detects dead connection
  • connection discarded, new one created
  • prevents errors from using stale connections

troubleshooting#

503 errors (pool exhausted)#

symptoms:

  • QueuePool limit of size N overflow M reached
  • requests timing out waiting for connections

diagnosis:

# check pool status
engine = get_engine()
print(f"checked out: {engine.pool.checkedout()}")
print(f"pool size: {engine.pool.size()}")

solutions:

  1. increase DATABASE_POOL_SIZE
  2. add DATABASE_MAX_OVERFLOW for burst capacity
  3. investigate slow queries holding connections
  4. check for connection leaks (connections not being returned)

connection timeouts#

symptoms:

  • asyncpg.exceptions.ConnectionTimeoutError
  • requests failing to connect to database

diagnosis:

  • check database availability
  • check network latency
  • review logfire connection traces

solutions:

  1. verify database is responsive
  2. increase DATABASE_CONNECTION_TIMEOUT if network latency is high
  3. investigate database performance (CPU, I/O)

statement timeouts#

symptoms:

  • asyncpg.exceptions.QueryCanceledError
  • specific queries timing out

diagnosis:

  • identify slow queries in logs
  • check query execution plans
  • review database indexes

solutions:

  1. optimize slow queries
  2. add database indexes
  3. increase DATABASE_STATEMENT_TIMEOUT if queries are legitimately slow
  4. consider background processing for long operations

implementation details#

asyncpg-specific configuration#

when using postgresql+asyncpg:// URLs, additional connection settings are applied:

connect_args = {
    # unique prepared statement names per connection
    "prepared_statement_name_func": lambda: f"__asyncpg_{uuid.uuid4()}__",

    # disable statement caching (prevents prepared statement conflicts)
    "statement_cache_size": 0,
    "prepared_statement_cache_size": 0,

    # statement timeout
    "command_timeout": DATABASE_STATEMENT_TIMEOUT,

    # connection timeout
    "timeout": DATABASE_CONNECTION_TIMEOUT,
}

these settings prevent prepared statement caching issues in asyncpg while maintaining performance.

LIFO pool strategy#

connections are returned in LIFO (last-in, first-out) order:

pool_use_lifo=True

benefits:

  • recently used connections are more likely to still be valid
  • helps shed excess connections after load spikes
  • increases likelihood of connection cache hits

references#