fix: increase connection pool resilience for Neon cold starts (#422)

after 5 minutes idle, Neon scales down and cold start takes 5-10s.
first requests after idle would exhaust the pool (5 connections),
causing all subsequent requests to fail with 500 errors.

changes:
- pool_size: 5 → 10 (more concurrent cold start requests)
- max_overflow: 0 → 5 (burst capacity to 15 connections)
- connection_timeout: 3s → 10s (wait for Neon wake-up)

this is a recurrence of the Nov 17 incident. that fix addressed the
queue listener's asyncpg connection but not the SQLAlchemy pool.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude <noreply@anthropic.com>

authored by zzstoatzz.io Claude and committed by GitHub d4b6d70e 124a1fc6

Changed files
+92 -40
backend
src
backend
docs
backend
+22 -1
STATUS.md
··· 47 48 ### December 2025 49 50 #### now-playing API (PR #416, Dec 1) 51 52 **motivation**: expose what users are currently listening to via public API ··· 907 908 --- 909 910 - this is a living document. last updated 2025-12-01.
··· 47 48 ### December 2025 49 50 + #### connection pool resilience for Neon cold starts (Dec 2) 51 + 52 + **incident**: ~5 minute API outage (01:55-02:00 UTC) - all requests returned 500 errors 53 + 54 + **root cause**: Neon serverless cold start after 5 minutes of idle traffic 55 + - queue listener heartbeat detected dead connection, began reconnection 56 + - first 5 user requests each held a connection waiting for Neon to wake up (3-5 min each) 57 + - with pool_size=5 and max_overflow=0, pool exhausted immediately 58 + - all subsequent requests got `QueuePool limit of size 5 overflow 0 reached` 59 + 60 + **fix**: 61 + - increased `pool_size` from 5 → 10 (handle more concurrent cold start requests) 62 + - increased `max_overflow` from 0 → 5 (allow burst to 15 connections) 63 + - increased `connection_timeout` from 3s → 10s (wait for Neon wake-up) 64 + 65 + **related**: this is a recurrence of the Nov 17 incident. that fix addressed the queue listener's asyncpg connection but not the SQLAlchemy pool connections. 66 + 67 + **documentation**: updated `docs/backend/database/connection-pooling.md` with Neon serverless considerations and incident history. 68 + 69 + --- 70 + 71 #### now-playing API (PR #416, Dec 1) 72 73 **motivation**: expose what users are currently listening to via public API ··· 928 929 --- 930 931 + this is a living document. last updated 2025-12-02.
+7 -4
backend/src/backend/config.py
··· 144 description="Timeout in seconds for SQL statement execution. Prevents runaway queries from holding connections indefinitely.", 145 ) 146 connection_timeout: float = Field( 147 - default=3.0, 148 validation_alias="DATABASE_CONNECTION_TIMEOUT", 149 - description="Timeout in seconds for establishing database connections. Fails fast when database is slow or unresponsive.", 150 ) 151 queue_connect_timeout: float = Field( 152 default=15.0, ··· 155 ) 156 157 # connection pool settings 158 pool_size: int = Field( 159 - default=5, 160 validation_alias="DATABASE_POOL_SIZE", 161 description="Number of database connections to keep in the pool at all times.", 162 ) 163 pool_max_overflow: int = Field( 164 - default=0, 165 validation_alias="DATABASE_MAX_OVERFLOW", 166 description="Maximum connections to create beyond pool_size when pool is exhausted. Total max connections = pool_size + pool_max_overflow.", 167 )
··· 144 description="Timeout in seconds for SQL statement execution. Prevents runaway queries from holding connections indefinitely.", 145 ) 146 connection_timeout: float = Field( 147 + default=10.0, 148 validation_alias="DATABASE_CONNECTION_TIMEOUT", 149 + description="Timeout in seconds for establishing database connections. Set higher than Neon cold start latency (~5-10s) to allow wake-up, but low enough to fail fast on true outages.", 150 ) 151 queue_connect_timeout: float = Field( 152 default=15.0, ··· 155 ) 156 157 # connection pool settings 158 + # sized to handle Neon cold start scenarios where multiple requests arrive simultaneously 159 + # after idle period. with pool_size=10 + max_overflow=5, we can handle 15 concurrent 160 + # requests waiting for Neon to wake up without exhausting the pool. 161 pool_size: int = Field( 162 + default=10, 163 validation_alias="DATABASE_POOL_SIZE", 164 description="Number of database connections to keep in the pool at all times.", 165 ) 166 pool_max_overflow: int = Field( 167 + default=5, 168 validation_alias="DATABASE_MAX_OVERFLOW", 169 description="Maximum connections to create beyond pool_size when pool is exhausted. Total max connections = pool_size + pool_max_overflow.", 170 )
+63 -35
docs/backend/database/connection-pooling.md
··· 20 # how long a single SQL query can run before being killed (default: 10s) 21 DATABASE_STATEMENT_TIMEOUT=10.0 22 23 - # how long to wait when establishing a new database connection (default: 3s) 24 - DATABASE_CONNECTION_TIMEOUT=3.0 25 26 # how long to wait for an available connection from the pool (default: = connection_timeout) 27 # this is automatically set to match DATABASE_CONNECTION_TIMEOUT ··· 30 **why these matter:** 31 32 - **statement_timeout**: prevents runaway queries from holding connections indefinitely. set based on your slowest expected query. 33 - - **connection_timeout**: fails fast when the database is slow or unreachable. prevents hanging indefinitely on connection attempts. 34 - **pool_timeout**: fails fast when all connections are busy. without this, requests wait forever when the pool is exhausted. 35 36 ### connection pool sizing 37 38 ```bash 39 - # number of persistent connections to maintain (default: 5) 40 - DATABASE_POOL_SIZE=5 41 42 - # additional connections to create on demand when pool is exhausted (default: 0) 43 - DATABASE_MAX_OVERFLOW=0 44 45 # how long before recycling a connection, in seconds (default: 7200 = 2 hours) 46 DATABASE_POOL_RECYCLE=7200 ··· 51 52 **sizing considerations:** 53 54 - total max connections = `pool_size` + `max_overflow` 55 56 **pool_size:** 57 - too small: connection contention, requests wait for available connections 58 - too large: wastes memory and database resources 59 - - rule of thumb: start with 5, increase if seeing pool exhaustion 60 61 **max_overflow:** 62 - - `0` (default): strict limit, fails fast when pool is full 63 - - `> 0`: creates additional connections on demand, provides burst capacity 64 - - tradeoff: graceful degradation vs predictable resource usage 65 66 **pool_recycle:** 67 - prevents stale connections from lingering ··· 73 - prevents using connections that were closed by the database 74 - recommended for production to avoid connection errors 75 76 - ## production best practices 77 78 - ### small-scale (current deployment) 79 80 - for a single-instance deployment with moderate traffic: 81 82 ```bash 83 - # strict pool, fail fast 84 - DATABASE_POOL_SIZE=5 85 - DATABASE_MAX_OVERFLOW=0 86 87 - # conservative timeouts 88 DATABASE_STATEMENT_TIMEOUT=10.0 89 - DATABASE_CONNECTION_TIMEOUT=3.0 90 91 # standard recycle 92 DATABASE_POOL_RECYCLE=7200 93 ``` 94 95 this configuration: 96 - - keeps resource usage predictable 97 - - fails fast under database issues 98 - - prevents cascading failures 99 100 - ### scaling up 101 - 102 - if experiencing pool exhaustion (503 errors, connection timeouts): 103 104 **option 1: increase pool size** 105 ```bash 106 - DATABASE_POOL_SIZE=10 107 - DATABASE_MAX_OVERFLOW=0 108 ``` 109 - pros: more concurrent capacity, still predictable 110 - cons: more memory/database connections 111 112 - **option 2: add overflow** 113 ```bash 114 - DATABASE_POOL_SIZE=5 115 - DATABASE_MAX_OVERFLOW=5 # allows 10 total under burst load 116 ``` 117 - pros: handles traffic spikes, efficient baseline 118 - cons: less predictable resource usage 119 120 ### tuning statement timeout 121
··· 20 # how long a single SQL query can run before being killed (default: 10s) 21 DATABASE_STATEMENT_TIMEOUT=10.0 22 23 + # how long to wait when establishing a new database connection (default: 10s) 24 + # set higher than Neon cold start latency (~5-10s) to allow wake-up 25 + DATABASE_CONNECTION_TIMEOUT=10.0 26 27 # how long to wait for an available connection from the pool (default: = connection_timeout) 28 # this is automatically set to match DATABASE_CONNECTION_TIMEOUT ··· 31 **why these matter:** 32 33 - **statement_timeout**: prevents runaway queries from holding connections indefinitely. set based on your slowest expected query. 34 + - **connection_timeout**: fails fast when the database is slow or unreachable. set higher than Neon cold start latency (5-10s) to allow serverless databases to wake up after idle periods. 35 - **pool_timeout**: fails fast when all connections are busy. without this, requests wait forever when the pool is exhausted. 36 37 ### connection pool sizing 38 39 ```bash 40 + # number of persistent connections to maintain (default: 10) 41 + DATABASE_POOL_SIZE=10 42 43 + # additional connections to create on demand when pool is exhausted (default: 5) 44 + DATABASE_MAX_OVERFLOW=5 45 46 # how long before recycling a connection, in seconds (default: 7200 = 2 hours) 47 DATABASE_POOL_RECYCLE=7200 ··· 52 53 **sizing considerations:** 54 55 + total max connections = `pool_size` + `max_overflow` = 15 by default 56 57 **pool_size:** 58 - too small: connection contention, requests wait for available connections 59 - too large: wastes memory and database resources 60 + - default of 10 handles Neon cold start scenarios where multiple requests arrive after idle periods 61 62 **max_overflow:** 63 + - provides burst capacity for traffic spikes 64 + - default of 5 allows 15 total connections under peak load 65 + - connections beyond pool_size are closed when returned (not kept idle) 66 67 **pool_recycle:** 68 - prevents stale connections from lingering ··· 74 - prevents using connections that were closed by the database 75 - recommended for production to avoid connection errors 76 77 + ## Neon serverless considerations 78 79 + plyr.fm uses Neon PostgreSQL, which scales to zero after periods of inactivity. this introduces **cold start latency** that affects connection pooling: 80 81 + ### the cold start problem 82 + 83 + 1. site is idle for several minutes → Neon scales down 84 + 2. first request arrives → Neon needs 5-10s to wake up 85 + 3. if pool_size is too small, all connections hang waiting for Neon 86 + 4. new requests can't get connections → 500 errors 87 + 88 + ### how we mitigate this 89 + 90 + **larger connection pool (pool_size=10, max_overflow=5):** 91 + - allows 15 concurrent requests to wait for Neon wake-up 92 + - prevents pool exhaustion during cold start 93 + 94 + **appropriate connection timeout (10s):** 95 + - long enough to wait for Neon cold start (~5-10s) 96 + - short enough to fail fast on true database outages 97 + 98 + **queue listener heartbeat:** 99 + - background task pings database every 5s 100 + - detects connection death before user requests fail 101 + - triggers reconnection with exponential backoff 102 + 103 + ### incident history 104 + 105 + - **2025-11-17**: first pool exhaustion outage - queue listener hung indefinitely on slow database. fix: added 15s timeout to asyncpg.connect() in queue service. 106 + - **2025-12-02**: cold start recurrence - 5 minute idle period caused Neon to scale down. first 5 requests after wake-up hung for 3-5 minutes each, exhausting pool. fix: increased pool_size to 10, max_overflow to 5, connection_timeout to 10s. 107 + 108 + ## production best practices 109 + 110 + ### current deployment (Neon serverless) 111 112 ```bash 113 + # pool sized for cold start scenarios 114 + DATABASE_POOL_SIZE=10 115 + DATABASE_MAX_OVERFLOW=5 116 117 + # timeout accounts for Neon wake-up latency 118 DATABASE_STATEMENT_TIMEOUT=10.0 119 + DATABASE_CONNECTION_TIMEOUT=10.0 120 121 # standard recycle 122 DATABASE_POOL_RECYCLE=7200 123 ``` 124 125 this configuration: 126 + - handles 15 concurrent requests during Neon cold start 127 + - fails fast (10s) on true database issues 128 + - balances resource usage with reliability 129 130 + ### if seeing pool exhaustion 131 132 **option 1: increase pool size** 133 ```bash 134 + DATABASE_POOL_SIZE=15 135 + DATABASE_MAX_OVERFLOW=5 136 ``` 137 + pros: more concurrent capacity during cold starts 138 + cons: more database connections when warm 139 140 + **option 2: increase overflow** 141 ```bash 142 + DATABASE_POOL_SIZE=10 143 + DATABASE_MAX_OVERFLOW=10 # allows 20 total under burst 144 ``` 145 + pros: higher burst capacity, same baseline 146 + cons: less predictable peak resource usage 147 148 ### tuning statement timeout 149