Add configurable database connection pool settings (#281)

* add configurable database connection pool settings

adds comprehensive pool configuration based on production best practices:

timeouts:
- DATABASE_STATEMENT_TIMEOUT (10s) - kills runaway queries
- DATABASE_CONNECTION_TIMEOUT (3s) - fails fast on db issues
- pool_timeout = connection_timeout - fails fast when pool exhausted

pool sizing:
- DATABASE_POOL_SIZE (5) - persistent connections
- DATABASE_MAX_OVERFLOW (0) - strict limit, no burst capacity
- DATABASE_POOL_RECYCLE (7200s) - recycle after 2 hours
- DATABASE_POOL_PRE_PING (true) - verify connection health

all settings are now configurable via env vars with sensible defaults.

also adds comprehensive documentation at docs/backend/database/connection-pooling.md covering:
- configuration reference
- production tuning guidance
- monitoring and troubleshooting
- scaling strategies

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

* add database performance dashboard for logfire

new comprehensive dashboard query showing:
- query performance grouped by type (SELECT, INSERT, UPDATE, DELETE) and table
- latency percentiles (avg, p50, p95, p99, max)
- error counts and rates
- sorted by impact (query count × avg duration)

includes 3 alternative query modes:
1. default: aggregated metrics by query pattern and table
2. slowest queries: top 25 individual slow query instances
3. timeline: hourly query volume and performance trends
4. transactions: connection pool and transaction metrics

classifies queries by primary table (tracks, artists, albums, queue_state, etc)
to identify which parts of the schema are under most load.

useful for:
- identifying slow query patterns
- monitoring query volume by table
- tracking database error rates
- analyzing connection pool health

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

* fix database dashboard to extract tables from actual SQL queries

the span_name is just 'SELECT neondb' / 'UPDATE neondb' etc.
the actual SQL query is in attributes->>'db.statement'.

updated dashboard to:
- extract table names from the actual SQL text in attributes
- distinguish between simple queries and JOINs (e.g. 'tracks' vs 'tracks+joins')
- use exact span_name matches instead of LIKE patterns
- show first 150 chars of actual query in slowest queries view

now the dashboard will properly classify queries by table instead of
showing everything as 'other/multiple'.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

* fix query timeline to use per-minute buckets instead of hourly

hourly bucketing was creating artificial spikes at every hour mark
because all queries from 19:00-19:59 were grouped into a single point.

changes:
- use DATE_TRUNC('minute') instead of DATE_TRUNC('hour')
- reduce time window to 2 hours (less noise, better detail)
- rename to 'Time' instead of 'Hour' for clarity

this will create a much smoother time series chart showing
actual query patterns instead of hourly spike artifacts.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

---------

Co-authored-by: Claude <noreply@anthropic.com>

authored by zzstoatzz.io Claude and committed by GitHub a27a924c d6c22c19

Changed files
+570 -5
docs
src
backend
+296
docs/backend/database/connection-pooling.md
···
··· 1 + # database connection pooling 2 + 3 + configuration and best practices for managing database connections in production. 4 + 5 + ## overview 6 + 7 + plyr.fm uses SQLAlchemy's connection pooling to manage PostgreSQL connections efficiently. proper pool configuration is critical for: 8 + 9 + - preventing connection exhaustion under load 10 + - failing fast when the database is slow or unresponsive 11 + - optimizing resource usage across concurrent requests 12 + 13 + ## configuration 14 + 15 + all settings are configurable via environment variables and defined in `src/backend/config.py`. 16 + 17 + ### timeouts 18 + 19 + ```bash 20 + # how long a single SQL query can run before being killed (default: 10s) 21 + DATABASE_STATEMENT_TIMEOUT=10.0 22 + 23 + # how long to wait when establishing a new database connection (default: 3s) 24 + DATABASE_CONNECTION_TIMEOUT=3.0 25 + 26 + # how long to wait for an available connection from the pool (default: = connection_timeout) 27 + # this is automatically set to match DATABASE_CONNECTION_TIMEOUT 28 + ``` 29 + 30 + **why these matter:** 31 + 32 + - **statement_timeout**: prevents runaway queries from holding connections indefinitely. set based on your slowest expected query. 33 + - **connection_timeout**: fails fast when the database is slow or unreachable. prevents hanging indefinitely on connection attempts. 34 + - **pool_timeout**: fails fast when all connections are busy. without this, requests wait forever when the pool is exhausted. 35 + 36 + ### connection pool sizing 37 + 38 + ```bash 39 + # number of persistent connections to maintain (default: 5) 40 + DATABASE_POOL_SIZE=5 41 + 42 + # additional connections to create on demand when pool is exhausted (default: 0) 43 + DATABASE_MAX_OVERFLOW=0 44 + 45 + # how long before recycling a connection, in seconds (default: 7200 = 2 hours) 46 + DATABASE_POOL_RECYCLE=7200 47 + 48 + # verify connection health before using from pool (default: true) 49 + DATABASE_POOL_PRE_PING=true 50 + ``` 51 + 52 + **sizing considerations:** 53 + 54 + total max connections = `pool_size` + `max_overflow` 55 + 56 + **pool_size:** 57 + - too small: connection contention, requests wait for available connections 58 + - too large: wastes memory and database resources 59 + - rule of thumb: start with 5, increase if seeing pool exhaustion 60 + 61 + **max_overflow:** 62 + - `0` (default): strict limit, fails fast when pool is full 63 + - `> 0`: creates additional connections on demand, provides burst capacity 64 + - tradeoff: graceful degradation vs predictable resource usage 65 + 66 + **pool_recycle:** 67 + - prevents stale connections from lingering 68 + - should be less than your database's connection timeout 69 + - 2 hours is a safe default for most PostgreSQL configurations 70 + 71 + **pool_pre_ping:** 72 + - adds small overhead (SELECT 1) before each connection use 73 + - prevents using connections that were closed by the database 74 + - recommended for production to avoid connection errors 75 + 76 + ## production best practices 77 + 78 + ### small-scale (current deployment) 79 + 80 + for a single-instance deployment with moderate traffic: 81 + 82 + ```bash 83 + # strict pool, fail fast 84 + DATABASE_POOL_SIZE=5 85 + DATABASE_MAX_OVERFLOW=0 86 + 87 + # conservative timeouts 88 + DATABASE_STATEMENT_TIMEOUT=10.0 89 + DATABASE_CONNECTION_TIMEOUT=3.0 90 + 91 + # standard recycle 92 + DATABASE_POOL_RECYCLE=7200 93 + ``` 94 + 95 + this configuration: 96 + - keeps resource usage predictable 97 + - fails fast under database issues 98 + - prevents cascading failures 99 + 100 + ### scaling up 101 + 102 + if experiencing pool exhaustion (503 errors, connection timeouts): 103 + 104 + **option 1: increase pool size** 105 + ```bash 106 + DATABASE_POOL_SIZE=10 107 + DATABASE_MAX_OVERFLOW=0 108 + ``` 109 + pros: more concurrent capacity, still predictable 110 + cons: more memory/database connections 111 + 112 + **option 2: add overflow** 113 + ```bash 114 + DATABASE_POOL_SIZE=5 115 + DATABASE_MAX_OVERFLOW=5 # allows 10 total under burst load 116 + ``` 117 + pros: handles traffic spikes, efficient baseline 118 + cons: less predictable resource usage 119 + 120 + ### tuning statement timeout 121 + 122 + adjust based on your query patterns: 123 + 124 + ```bash 125 + # strict timeout for API workloads 126 + DATABASE_STATEMENT_TIMEOUT=2.0 127 + 128 + # lenient for long-running operations (uploads, processing) 129 + DATABASE_STATEMENT_TIMEOUT=30.0 130 + ``` 131 + 132 + analyze slow query logs to understand p99 query latency, then set timeout with appropriate headroom. 133 + 134 + ## monitoring 135 + 136 + ### what to watch 137 + 138 + 1. **pool utilization**: `engine.pool.checkedout()` / `pool_size` 139 + - alert if > 80% for sustained periods 140 + - indicates need to increase pool_size 141 + 142 + 2. **connection timeouts**: rate of pool timeout errors 143 + - indicates pool exhaustion 144 + - consider increasing pool_size or max_overflow 145 + 146 + 3. **statement timeouts**: rate of query timeout errors 147 + - indicates slow queries or inappropriate timeout 148 + - investigate slow queries, consider adjusting timeout 149 + 150 + 4. **connection errors**: failed connection attempts 151 + - database availability issues 152 + - network problems 153 + 154 + ### observability 155 + 156 + the engine is instrumented with logfire when observability is enabled (`LOGFIRE_ENABLED=true`). this provides: 157 + 158 + - connection pool metrics 159 + - query execution times 160 + - timeout events 161 + - connection errors 162 + 163 + review logfire traces to understand: 164 + - which queries are slow 165 + - when pool exhaustion occurs 166 + - connection patterns under load 167 + 168 + ## how it works 169 + 170 + ### connection lifecycle 171 + 172 + 1. **request arrives** → acquire connection from pool 173 + 2. **execute query** → use connection (with statement_timeout) 174 + 3. **request complete** → return connection to pool 175 + 4. **recycle threshold** → close and replace old connections 176 + 177 + ### failure modes 178 + 179 + **pool exhausted:** 180 + - all `pool_size` + `max_overflow` connections in use 181 + - new requests wait for `pool_timeout` seconds 182 + - timeout → 503 error (fail fast) 183 + 184 + **database slow:** 185 + - connection attempt exceeds `connection_timeout` 186 + - timeout → fail fast, can retry 187 + 188 + **query too slow:** 189 + - query exceeds `statement_timeout` 190 + - query killed, connection returned to pool 191 + - prevents one slow query from blocking others 192 + 193 + **connection died:** 194 + - `pool_pre_ping` detects dead connection 195 + - connection discarded, new one created 196 + - prevents errors from using stale connections 197 + 198 + ## troubleshooting 199 + 200 + ### 503 errors (pool exhausted) 201 + 202 + **symptoms:** 203 + - `QueuePool limit of size N overflow M reached` 204 + - requests timing out waiting for connections 205 + 206 + **diagnosis:** 207 + ```python 208 + # check pool status 209 + engine = get_engine() 210 + print(f"checked out: {engine.pool.checkedout()}") 211 + print(f"pool size: {engine.pool.size()}") 212 + ``` 213 + 214 + **solutions:** 215 + 1. increase `DATABASE_POOL_SIZE` 216 + 2. add `DATABASE_MAX_OVERFLOW` for burst capacity 217 + 3. investigate slow queries holding connections 218 + 4. check for connection leaks (connections not being returned) 219 + 220 + ### connection timeouts 221 + 222 + **symptoms:** 223 + - `asyncpg.exceptions.ConnectionTimeoutError` 224 + - requests failing to connect to database 225 + 226 + **diagnosis:** 227 + - check database availability 228 + - check network latency 229 + - review logfire connection traces 230 + 231 + **solutions:** 232 + 1. verify database is responsive 233 + 2. increase `DATABASE_CONNECTION_TIMEOUT` if network latency is high 234 + 3. investigate database performance (CPU, I/O) 235 + 236 + ### statement timeouts 237 + 238 + **symptoms:** 239 + - `asyncpg.exceptions.QueryCanceledError` 240 + - specific queries timing out 241 + 242 + **diagnosis:** 243 + - identify slow queries in logs 244 + - check query execution plans 245 + - review database indexes 246 + 247 + **solutions:** 248 + 1. optimize slow queries 249 + 2. add database indexes 250 + 3. increase `DATABASE_STATEMENT_TIMEOUT` if queries are legitimately slow 251 + 4. consider background processing for long operations 252 + 253 + ## implementation details 254 + 255 + ### asyncpg-specific configuration 256 + 257 + when using `postgresql+asyncpg://` URLs, additional connection settings are applied: 258 + 259 + ```python 260 + connect_args = { 261 + # unique prepared statement names per connection 262 + "prepared_statement_name_func": lambda: f"__asyncpg_{uuid.uuid4()}__", 263 + 264 + # disable statement caching (prevents prepared statement conflicts) 265 + "statement_cache_size": 0, 266 + "prepared_statement_cache_size": 0, 267 + 268 + # statement timeout 269 + "command_timeout": DATABASE_STATEMENT_TIMEOUT, 270 + 271 + # connection timeout 272 + "timeout": DATABASE_CONNECTION_TIMEOUT, 273 + } 274 + ``` 275 + 276 + these settings prevent prepared statement caching issues in asyncpg while maintaining performance. 277 + 278 + ### LIFO pool strategy 279 + 280 + connections are returned in LIFO (last-in, first-out) order: 281 + 282 + ```python 283 + pool_use_lifo=True 284 + ``` 285 + 286 + benefits: 287 + - recently used connections are more likely to still be valid 288 + - helps shed excess connections after load spikes 289 + - increases likelihood of connection cache hits 290 + 291 + ## references 292 + 293 + - implementation: `src/backend/utilities/database.py` 294 + - settings: `src/backend/config.py` 295 + - sqlalchemy pooling: https://docs.sqlalchemy.org/en/20/core/pooling.html 296 + - asyncpg connections: https://magicstack.github.io/asyncpg/current/api/index.html#connection
+208
docs/tools/logfire-database-dashboard.sql
···
··· 1 + -- Database Performance Dashboard 2 + -- Shows aggregated database query metrics grouped by query pattern 3 + 4 + WITH query_classification AS ( 5 + -- Classify queries by type and extract table names from actual SQL 6 + SELECT 7 + span_name, 8 + attributes->>'db.statement' as query_text, 9 + duration, 10 + start_timestamp, 11 + trace_id, 12 + otel_status_code, 13 + CASE 14 + WHEN span_name LIKE 'SELECT%' THEN 'SELECT' 15 + WHEN span_name LIKE 'INSERT%' THEN 'INSERT' 16 + WHEN span_name LIKE 'UPDATE%' THEN 'UPDATE' 17 + WHEN span_name LIKE 'DELETE%' THEN 'DELETE' 18 + WHEN span_name LIKE 'BEGIN%' THEN 'TRANSACTION' 19 + WHEN span_name LIKE 'COMMIT%' THEN 'TRANSACTION' 20 + WHEN span_name LIKE 'ROLLBACK%' THEN 'TRANSACTION' 21 + ELSE 'OTHER' 22 + END AS query_type, 23 + -- Extract primary table from actual SQL query in attributes 24 + CASE 25 + -- tracks queries (most common) 26 + WHEN attributes->>'db.statement' LIKE '%FROM tracks%' AND attributes->>'db.statement' NOT LIKE '%JOIN%' THEN 'tracks' 27 + WHEN attributes->>'db.statement' LIKE '%FROM tracks JOIN%' THEN 'tracks+joins' 28 + WHEN attributes->>'db.statement' LIKE '%INTO tracks%' THEN 'tracks' 29 + WHEN attributes->>'db.statement' LIKE '%UPDATE tracks%' THEN 'tracks' 30 + 31 + -- track_likes queries 32 + WHEN attributes->>'db.statement' LIKE '%FROM track_likes%' THEN 'track_likes' 33 + WHEN attributes->>'db.statement' LIKE '%INTO track_likes%' THEN 'track_likes' 34 + 35 + -- user auth/session queries 36 + WHEN attributes->>'db.statement' LIKE '%FROM user_sessions%' THEN 'user_sessions' 37 + WHEN attributes->>'db.statement' LIKE '%INTO user_sessions%' THEN 'user_sessions' 38 + WHEN attributes->>'db.statement' LIKE '%UPDATE user_sessions%' THEN 'user_sessions' 39 + 40 + -- user preferences 41 + WHEN attributes->>'db.statement' LIKE '%FROM user_preferences%' THEN 'user_preferences' 42 + WHEN attributes->>'db.statement' LIKE '%INTO user_preferences%' THEN 'user_preferences' 43 + WHEN attributes->>'db.statement' LIKE '%UPDATE user_preferences%' THEN 'user_preferences' 44 + 45 + -- queue state 46 + WHEN attributes->>'db.statement' LIKE '%FROM queue_state%' THEN 'queue_state' 47 + WHEN attributes->>'db.statement' LIKE '%INTO queue_state%' THEN 'queue_state' 48 + WHEN attributes->>'db.statement' LIKE '%UPDATE queue_state%' THEN 'queue_state' 49 + 50 + -- artists 51 + WHEN attributes->>'db.statement' LIKE '%FROM artists%' THEN 'artists' 52 + WHEN attributes->>'db.statement' LIKE '%INTO artists%' THEN 'artists' 53 + WHEN attributes->>'db.statement' LIKE '%UPDATE artists%' THEN 'artists' 54 + 55 + -- albums 56 + WHEN attributes->>'db.statement' LIKE '%FROM albums%' THEN 'albums' 57 + WHEN attributes->>'db.statement' LIKE '%INTO albums%' THEN 'albums' 58 + 59 + ELSE 'other/unknown' 60 + END AS primary_table 61 + FROM records 62 + WHERE 63 + -- Filter for database operations 64 + (span_name = 'SELECT neondb' 65 + OR span_name = 'INSERT neondb' 66 + OR span_name = 'UPDATE neondb' 67 + OR span_name = 'DELETE neondb' 68 + OR span_name = 'BEGIN neondb' 69 + OR span_name = 'COMMIT neondb' 70 + OR span_name = 'ROLLBACK neondb') 71 + -- Exclude very fast queries (connection pool pings, etc.) 72 + AND duration > 0.001 73 + ), 74 + aggregated_metrics AS ( 75 + -- Aggregate by query type and table 76 + SELECT 77 + query_type, 78 + primary_table, 79 + COUNT(*) AS total_queries, 80 + ROUND(AVG(duration * 1000)::numeric, 2) AS avg_duration_ms, 81 + ROUND(PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY duration * 1000)::numeric, 2) AS p50_duration_ms, 82 + ROUND(PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY duration * 1000)::numeric, 2) AS p95_duration_ms, 83 + ROUND(PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY duration * 1000)::numeric, 2) AS p99_duration_ms, 84 + ROUND(MAX(duration * 1000)::numeric, 2) AS max_duration_ms, 85 + COUNT(*) FILTER (WHERE otel_status_code = 'ERROR') AS error_count, 86 + ROUND((COUNT(*) FILTER (WHERE otel_status_code = 'ERROR')::numeric / COUNT(*)::numeric * 100), 2) AS error_rate_pct 87 + FROM query_classification 88 + GROUP BY query_type, primary_table 89 + ) 90 + SELECT 91 + query_type AS "Query Type", 92 + primary_table AS "Table", 93 + total_queries AS "Count", 94 + avg_duration_ms AS "Avg (ms)", 95 + p50_duration_ms AS "p50 (ms)", 96 + p95_duration_ms AS "p95 (ms)", 97 + p99_duration_ms AS "p99 (ms)", 98 + max_duration_ms AS "Max (ms)", 99 + error_count AS "Errors", 100 + COALESCE(error_rate_pct, 0) AS "Error Rate %" 101 + FROM aggregated_metrics 102 + WHERE total_queries > 0 103 + ORDER BY 104 + -- Sort by most impactful queries first 105 + (total_queries * avg_duration_ms) DESC, 106 + error_count DESC 107 + LIMIT 50; 108 + 109 + 110 + -- Alternative: Top Slowest Individual Queries 111 + -- Uncomment to see the actual slowest query instances instead of aggregated patterns 112 + /* 113 + SELECT 114 + ROUND(duration * 1000, 2) AS "Duration (ms)", 115 + CASE 116 + WHEN span_name = 'SELECT neondb' THEN 'SELECT' 117 + WHEN span_name = 'INSERT neondb' THEN 'INSERT' 118 + WHEN span_name = 'UPDATE neondb' THEN 'UPDATE' 119 + WHEN span_name = 'DELETE neondb' THEN 'DELETE' 120 + ELSE 'OTHER' 121 + END AS "Type", 122 + LEFT(attributes->>'db.statement', 150) AS "Query Preview", 123 + start_timestamp AS "Timestamp", 124 + trace_id AS "Trace ID", 125 + otel_status_code AS "Status" 126 + FROM records 127 + WHERE 128 + (span_name = 'SELECT neondb' 129 + OR span_name = 'INSERT neondb' 130 + OR span_name = 'UPDATE neondb' 131 + OR span_name = 'DELETE neondb') 132 + AND duration > 0.001 133 + ORDER BY duration DESC 134 + LIMIT 25; 135 + */ 136 + 137 + 138 + -- Alternative: Database Operations Timeline (5-minute aggregation) 139 + -- Uncomment to see query volume and performance over time 140 + /* 141 + WITH time_metrics AS ( 142 + SELECT 143 + DATE_TRUNC('minute', start_timestamp) AS minute, 144 + CASE 145 + WHEN span_name = 'SELECT neondb' THEN 'SELECT' 146 + WHEN span_name = 'INSERT neondb' THEN 'INSERT' 147 + WHEN span_name = 'UPDATE neondb' THEN 'UPDATE' 148 + WHEN span_name = 'DELETE neondb' THEN 'DELETE' 149 + ELSE 'OTHER' 150 + END AS query_type, 151 + COUNT(*) AS query_count, 152 + ROUND(AVG(duration * 1000)::numeric, 2) AS avg_duration_ms, 153 + COUNT(*) FILTER (WHERE otel_status_code = 'ERROR') AS error_count 154 + FROM records 155 + WHERE 156 + (span_name = 'SELECT neondb' 157 + OR span_name = 'INSERT neondb' 158 + OR span_name = 'UPDATE neondb' 159 + OR span_name = 'DELETE neondb') 160 + AND start_timestamp > NOW() - INTERVAL '2 hours' 161 + GROUP BY minute, query_type 162 + ) 163 + SELECT 164 + minute AS "Time", 165 + query_type AS "Type", 166 + query_count AS "Count", 167 + avg_duration_ms AS "Avg Duration (ms)", 168 + error_count AS "Errors" 169 + FROM time_metrics 170 + ORDER BY minute DESC, query_count DESC; 171 + */ 172 + 173 + 174 + -- Alternative: Connection Pool and Transaction Metrics 175 + -- Uncomment to analyze transaction patterns and connection behavior 176 + /* 177 + WITH transaction_metrics AS ( 178 + SELECT 179 + DATE_TRUNC('minute', start_timestamp) AS minute, 180 + COUNT(*) FILTER (WHERE span_name = 'BEGIN neondb') AS begin_count, 181 + COUNT(*) FILTER (WHERE span_name = 'COMMIT neondb') AS commit_count, 182 + COUNT(*) FILTER (WHERE span_name = 'ROLLBACK neondb') AS rollback_count, 183 + ROUND(AVG(duration * 1000) FILTER (WHERE span_name = 'BEGIN neondb')::numeric, 2) AS avg_begin_ms, 184 + ROUND(AVG(duration * 1000) FILTER (WHERE span_name = 'COMMIT neondb')::numeric, 2) AS avg_commit_ms 185 + FROM records 186 + WHERE 187 + span_name = 'BEGIN neondb' 188 + OR span_name = 'COMMIT neondb' 189 + OR span_name = 'ROLLBACK neondb' 190 + GROUP BY minute 191 + ) 192 + SELECT 193 + minute AS "Minute", 194 + begin_count AS "Begins", 195 + commit_count AS "Commits", 196 + rollback_count AS "Rollbacks", 197 + avg_begin_ms AS "Avg BEGIN (ms)", 198 + avg_commit_ms AS "Avg COMMIT (ms)", 199 + CASE 200 + WHEN begin_count > 0 201 + THEN ROUND((rollback_count::numeric / begin_count::numeric * 100), 2) 202 + ELSE 0 203 + END AS "Rollback Rate %" 204 + FROM transaction_metrics 205 + WHERE begin_count > 0 OR commit_count > 0 OR rollback_count > 0 206 + ORDER BY minute DESC 207 + LIMIT 60; 208 + */
+15
docs/tools/logfire.md
··· 288 3. Add retry logic for initial database connections 289 4. Consider connection pool size tuning 290 291 ## Resources 292 293 - [Logfire SQL Explorer Documentation](https://logfire.pydantic.dev/docs/guides/web-ui/explore/)
··· 288 3. Add retry logic for initial database connections 289 4. Consider connection pool size tuning 290 291 + ## Pre-built Dashboards 292 + 293 + ### Database Performance Dashboard 294 + 295 + comprehensive database query analysis showing: 296 + - query performance by type and table 297 + - latency percentiles (p50, p95, p99) 298 + - error rates and counts 299 + - query volume and impact 300 + 301 + see `logfire-database-dashboard.sql` for the full query with alternative views for: 302 + - slowest individual queries 303 + - hourly query volume timeline 304 + - transaction and connection pool metrics 305 + 306 ## Resources 307 308 - [Logfire SQL Explorer Documentation](https://logfire.pydantic.dev/docs/guides/web-ui/explore/)
+34
src/backend/config.py
··· 136 validation_alias="DATABASE_URL", 137 description="PostgreSQL connection string", 138 ) 139 queue_connect_timeout: float = Field( 140 default=3.0, 141 validation_alias="QUEUE_CONNECT_TIMEOUT", 142 description="Timeout in seconds for queue listener database connections", 143 ) 144 145
··· 136 validation_alias="DATABASE_URL", 137 description="PostgreSQL connection string", 138 ) 139 + 140 + # timeouts 141 + statement_timeout: float = Field( 142 + default=10.0, 143 + validation_alias="DATABASE_STATEMENT_TIMEOUT", 144 + description="Timeout in seconds for SQL statement execution. Prevents runaway queries from holding connections indefinitely.", 145 + ) 146 + connection_timeout: float = Field( 147 + default=3.0, 148 + validation_alias="DATABASE_CONNECTION_TIMEOUT", 149 + description="Timeout in seconds for establishing database connections. Fails fast when database is slow or unresponsive.", 150 + ) 151 queue_connect_timeout: float = Field( 152 default=3.0, 153 validation_alias="QUEUE_CONNECT_TIMEOUT", 154 description="Timeout in seconds for queue listener database connections", 155 + ) 156 + 157 + # connection pool settings 158 + pool_size: int = Field( 159 + default=5, 160 + validation_alias="DATABASE_POOL_SIZE", 161 + description="Number of database connections to keep in the pool at all times.", 162 + ) 163 + pool_max_overflow: int = Field( 164 + default=0, 165 + validation_alias="DATABASE_MAX_OVERFLOW", 166 + description="Maximum connections to create beyond pool_size when pool is exhausted. Total max connections = pool_size + pool_max_overflow.", 167 + ) 168 + pool_recycle: int = Field( 169 + default=7200, 170 + validation_alias="DATABASE_POOL_RECYCLE", 171 + description="Seconds before recycling a connection. Prevents stale connections from lingering. Default 2 hours.", 172 + ) 173 + pool_pre_ping: bool = Field( 174 + default=True, 175 + validation_alias="DATABASE_POOL_PRE_PING", 176 + description="Verify connection health before using from pool. Adds small overhead but prevents using dead connections.", 177 ) 178 179
+17 -5
src/backend/utilities/database.py
··· 39 "prepared_statement_cache_size": 0, 40 } 41 42 engine = create_async_engine( 43 settings.database.url, 44 echo=settings.app.debug, 45 - pool_pre_ping=True, # verify connections before use 46 - pool_recycle=3600, # recycle connections after 1 hour 47 - pool_use_lifo=True, # reuse recent connections 48 - pool_size=5, 49 - max_overflow=0, 50 **kwargs, 51 ) 52
··· 39 "prepared_statement_cache_size": 0, 40 } 41 42 + # apply statement timeout (command_timeout in asyncpg) 43 + if settings.database.statement_timeout: 44 + kwargs["connect_args"]["command_timeout"] = ( 45 + settings.database.statement_timeout 46 + ) 47 + 48 + # apply connection timeout 49 + if settings.database.connection_timeout: 50 + kwargs["connect_args"]["timeout"] = settings.database.connection_timeout 51 + # also set pool timeout to fail fast when pool is exhausted 52 + kwargs["pool_timeout"] = settings.database.connection_timeout 53 + 54 engine = create_async_engine( 55 settings.database.url, 56 echo=settings.app.debug, 57 + pool_pre_ping=settings.database.pool_pre_ping, 58 + pool_recycle=settings.database.pool_recycle, 59 + pool_use_lifo=True, # reuse recent connections (LIFO order) 60 + pool_size=settings.database.pool_size, 61 + max_overflow=settings.database.pool_max_overflow, 62 **kwargs, 63 ) 64