Incident Response Runbook

General incident detection, triage, and resolution playbook for Cruvero production environments. Covers Temporal, LLM provider, Postgres, and runaway workflow scenarios.

Trigger Conditions

Open an incident when any of the following occur:

Health checks return unhealthy for critical components.
Error rate spikes above normal baseline.
Workflow backlog and latency rise rapidly.
Cost or token usage anomalies indicate runaway behavior.

Detection and Triage

Confirm alert and incident scope (tenants, regions, components).
Pull health snapshots:
- curl -fsS http://<ui>/api/health
- curl -fsS http://<ui>/api/health/detail
Classify primary failing component:
- Temporal
- LLM provider/failover
- Postgres
- Runaway workflow/agent
Freeze high-risk operations if needed (bulk writes/imports).

Component Procedures

Temporal Down

Check Temporal frontend/matching/history status.
Confirm network path from worker namespace to Temporal.
Restart worker pods only after Temporal health is restored.
Verify workers reconnect and queue depth begins to drain.

Verification:

readyz on workers succeeds.
Workflow schedule-to-start latency returns toward SLO.

LLM Provider Down

Confirm provider errors (429/503/timeouts) from logs/metrics.
Verify failover chain switched providers.
Check failover audit/metric events for expected provider transition.
If no failover occurred, temporarily force provider switch via config and restart workers.

Verification:

New requests succeed via secondary provider.
/api/health/detail shows degraded primary, healthy active provider.

Postgres Down

Check primary DB health and replication state.
Trigger managed failover or Patroni promotion procedure.
Update connection endpoint if required.
Restart affected workloads after DB write availability recovers.

Verification:

SELECT 1 and critical table read/write checks pass.
No sustained increase in DB connection failures.

Runaway Agent / Workflow

Identify high-cost or looping workflow IDs.
Stop offending run(s):
- go run ./cmd/control --workflow-id <id> --action pause
- or terminate via Temporal CLI/UI policy.
Inspect quota usage and audit trail for runaway cause.
Apply tenant policy constraints (tool/model/quotas) before resuming traffic.

Verification:

Cost growth normalizes.
Queue depth and token usage stabilize.

Post-Incident Actions

Export relevant audit events for incident window.
Complete timeline and root cause analysis (RCA).
Define preventive controls (alerts, guardrails, config defaults).
Track follow-up items with owners and due dates.

Escalation Path

Primary on-call -> Incident Commander -> Platform/SRE.
Add Security lead for data exposure/credential concerns.
Add Database owner for failover/PITR decisions.
Add Product/Application owner for policy/workflow regressions.

Trigger Conditions​

Detection and Triage​

Component Procedures​

Temporal Down​

LLM Provider Down​

Postgres Down​

Runaway Agent / Workflow​

Post-Incident Actions​

Escalation Path​

Related Docs​

Trigger Conditions

Detection and Triage

Component Procedures

Temporal Down

LLM Provider Down

Postgres Down

Runaway Agent / Workflow

Post-Incident Actions

Escalation Path

Related Docs