Incident Response Runbook
General incident detection, triage, and resolution playbook for Cruvero production environments. Covers Temporal, LLM provider, Postgres, and runaway workflow scenarios.
Trigger Conditions
Open an incident when any of the following occur:
- Health checks return
unhealthyfor critical components. - Error rate spikes above normal baseline.
- Workflow backlog and latency rise rapidly.
- Cost or token usage anomalies indicate runaway behavior.
Detection and Triage
- Confirm alert and incident scope (tenants, regions, components).
- Pull health snapshots:
curl -fsS http://<ui>/api/healthcurl -fsS http://<ui>/api/health/detail
- Classify primary failing component:
- Temporal
- LLM provider/failover
- Postgres
- Runaway workflow/agent
- Freeze high-risk operations if needed (bulk writes/imports).
Component Procedures
Temporal Down
- Check Temporal frontend/matching/history status.
- Confirm network path from worker namespace to Temporal.
- Restart worker pods only after Temporal health is restored.
- Verify workers reconnect and queue depth begins to drain.
Verification:
readyzon workers succeeds.- Workflow schedule-to-start latency returns toward SLO.
LLM Provider Down
- Confirm provider errors (429/503/timeouts) from logs/metrics.
- Verify failover chain switched providers.
- Check failover audit/metric events for expected provider transition.
- If no failover occurred, temporarily force provider switch via config and restart workers.
Verification:
- New requests succeed via secondary provider.
/api/health/detailshows degraded primary, healthy active provider.
Postgres Down
- Check primary DB health and replication state.
- Trigger managed failover or Patroni promotion procedure.
- Update connection endpoint if required.
- Restart affected workloads after DB write availability recovers.
Verification:
SELECT 1and critical table read/write checks pass.- No sustained increase in DB connection failures.
Runaway Agent / Workflow
- Identify high-cost or looping workflow IDs.
- Stop offending run(s):
go run ./cmd/control --workflow-id <id> --action pause- or terminate via Temporal CLI/UI policy.
- Inspect quota usage and audit trail for runaway cause.
- Apply tenant policy constraints (tool/model/quotas) before resuming traffic.
Verification:
- Cost growth normalizes.
- Queue depth and token usage stabilize.
Post-Incident Actions
- Export relevant audit events for incident window.
- Complete timeline and root cause analysis (RCA).
- Define preventive controls (alerts, guardrails, config defaults).
- Track follow-up items with owners and due dates.
Escalation Path
- Primary on-call -> Incident Commander -> Platform/SRE.
- Add Security lead for data exposure/credential concerns.
- Add Database owner for failover/PITR decisions.
- Add Product/Application owner for policy/workflow regressions.