runbooks#
operational procedures for production incidents.
available runbooks#
- connection-pool-exhaustion - 500s everywhere, queue listener down, stuck connections
when to use#
runbooks are for known failure modes with established remediation steps. if you encounter a new type of incident:
- stabilize first (restart machines if needed)
- investigate using logfire
- document the incident and create a new runbook
general troubleshooting#
# check machine status
fly status -a relay-api
# view recent logs
fly logs -a relay-api
# restart machines
fly machines list -a relay-api
fly machines restart <machine-id> -a relay-api