Disaster Recovery Playbook
This runbook defines recovery targets and step-by-step actions for region/database/provider incidents.
RPO / RTO Targets
| Component | RPO Target | RTO Target | Notes |
|---|---|---|---|
| Temporal state/history | <= 5 minutes | <= 30 minutes | Depends on Temporal replication topology |
| Postgres (tenant, registry, quotas, audit) | <= 5 minutes | <= 30 minutes | Requires WAL archival + tested restore |
| Dragonfly cache | <= 1 hour | <= 15 minutes | Treat as rebuildable unless strict persistence configured |
| Tool registry exports | <= 24 hours | <= 15 minutes | Can be restored with backup registry-import |
| Audit archive objects | <= 24 hours | <= 1 hour | S3/object-store durability assumed |
Incident Types
- Region outage
- Postgres corruption or accidental data loss
- Temporal cluster outage
- LLM provider prolonged outage
Recovery Procedure
- Detect and declare incident.
- Freeze risky mutating operations (tenant writes, registry imports, bulk updates).
- Capture current status snapshots:
/api/health/api/health/detail- Temporal cluster and namespace health
- Restore dependencies in order:
- Postgres
- Temporal
- Workers/UI
- Verify tenant isolation, quotas, and audit integrity.
Postgres PITR Recovery
- Provision replacement Postgres cluster.
- Restore latest base backup.
- Replay WAL to target timestamp.
- Run
cmd/migrate --cmd up. - Validate row counts for critical tables:
tenantstool_registriesaudit_eventstenant_usage_daily
- Switch application DSN and restart workloads.
Temporal Recovery
- Confirm cluster membership and frontend availability.
- Validate namespace registration and replication status.
- Restart workers once Temporal health is green.
- Reconcile stuck workflows:
- query running workflows,
- terminate or retry long-stuck runs by policy.
Tool Registry Recovery
- Restore registries from backup JSON:
go run ./cmd/backup registry-import --in backups/tool-registries-*.json
- Verify hash immutability checks pass.
- Re-seed default/global registry if required.
Audit Recovery and Verification
- Restore archive objects from object store if needed.
- If
audit_eventswas restored, verify hash chains by tenant using/api/audit/verify. - Confirm no cross-tenant leakage in audit reads.
DR Test Script (Quarterly)
Run a controlled simulation in staging:
#!/usr/bin/env bash
set -euo pipefail
# 1) Scale workers down in primary region
kubectl -n cruvero scale deploy/cruvero-worker --replicas=0
# 2) Validate failover region serves traffic
curl -fsS https://cruvero.example.com/healthz
# 3) Run backup roundtrip test
go run ./cmd/backup dump --out /tmp/dr-test.dump
go run ./cmd/backup restore --in /tmp/dr-test.dump --clean
# 4) Validate core health endpoints
curl -fsS https://cruvero.example.com/api/health
curl -fsS https://cruvero.example.com/api/health/detail
Post-Recovery Checklist
- All critical checks report
healthyor approveddegraded. - Workflow execution and task queue latency back within SLO.
- Quota enforcement and audit logging confirmed working.
- LLM failover chain reports expected provider status.
- Incident notes completed with timeline and RCA owner assigned.
Related Artifacts
- Backup/restore drill script:
scripts/ops/backup-restore-drill.sh - DR readiness checklist:
docs/operations/checklists/dr-readiness.md - Alert rules:
deploy/monitoring/prometheus-rules.yaml,deploy/monitoring/loki-alert-rules.yaml