hotfix: add configurable timeout to queue listener database connection (#280)

CRITICAL FIX for production outage at 2025-11-18 05:04 UTC.

## root cause
asyncpg.connect() in queue listener had no timeout. when Neon database
was slow/unresponsive, the connection attempt would hang indefinitely,
exhausting SQLAlchemy's 5-connection pool and causing all API requests
to timeout with 500 errors.

## analysis
queried logfire for healthy connection latencies:
- typical: 2-4ms
- p50: 25ms
- p95: 352ms
- p99: 549ms

## fix
- add QUEUE_CONNECT_TIMEOUT setting (default: 3.0s)
- use asyncio.wait_for() with configurable timeout on asyncpg.connect()
- timeout provides 5x headroom over p99 while failing fast
- add specific TimeoutError handling with clear logging
- connection failures now fail fast instead of hanging

## configuration
adjust timeout via environment variable if needed:
```
QUEUE_CONNECT_TIMEOUT=5.0 # increase if seeing timeout errors
```

## impact
prevents queue listener from blocking all database connections when
Neon is slow. the listener will retry after reconnect_delay (5s)
instead of hanging indefinitely and breaking the entire app.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude <noreply@anthropic.com>

authored by zzstoatzz.io Claude and committed by GitHub d6c22c19 9f632cfc

Changed files
+14 -1
src
backend
_internal
+9 -1
src/backend/_internal/queue.py
··· 61 61 db_url = db_url.replace("postgresql+psycopg://", "postgresql://") 62 62 63 63 try: 64 - self.conn = await asyncpg.connect(db_url) 64 + # add timeout to prevent hanging indefinitely on slow/unresponsive database 65 + # default 3s timeout provides 5x headroom over p99 latency (~550ms) while failing fast 66 + timeout = settings.database.queue_connect_timeout 67 + self.conn = await asyncio.wait_for(asyncpg.connect(db_url), timeout=timeout) 65 68 if not self.conn: 66 69 raise Exception("failed to connect to database") 67 70 await self.conn.add_listener("queue_changes", self._handle_notification) 68 71 logger.info("queue service connected to database and listening") 72 + except TimeoutError: 73 + logger.error( 74 + f"database connection timed out after {settings.database.queue_connect_timeout}s for queue listening" 75 + ) 76 + raise 69 77 except Exception: 70 78 logger.exception("failed to connect to database for queue listening") 71 79 raise
+5
src/backend/config.py
··· 136 136 validation_alias="DATABASE_URL", 137 137 description="PostgreSQL connection string", 138 138 ) 139 + queue_connect_timeout: float = Field( 140 + default=3.0, 141 + validation_alias="QUEUE_CONNECT_TIMEOUT", 142 + description="Timeout in seconds for queue listener database connections", 143 + ) 139 144 140 145 141 146 class StorageSettings(RelaySettingsSection):