Skip to main content

Disaster Recovery Playbook

This runbook defines recovery targets and step-by-step actions for region/database/provider incidents.

RPO / RTO Targets

ComponentRPO TargetRTO TargetNotes
Temporal state/history<= 5 minutes<= 30 minutesDepends on Temporal replication topology
Postgres (tenant, registry, quotas, audit)<= 5 minutes<= 30 minutesRequires WAL archival + tested restore
Dragonfly cache<= 1 hour<= 15 minutesTreat as rebuildable unless strict persistence configured
Tool registry exports<= 24 hours<= 15 minutesCan be restored with backup registry-import
Audit archive objects<= 24 hours<= 1 hourS3/object-store durability assumed

Incident Types

  • Region outage
  • Postgres corruption or accidental data loss
  • Temporal cluster outage
  • LLM provider prolonged outage

Recovery Procedure

  1. Detect and declare incident.
  2. Freeze risky mutating operations (tenant writes, registry imports, bulk updates).
  3. Capture current status snapshots:
    • /api/health
    • /api/health/detail
    • Temporal cluster and namespace health
  4. Restore dependencies in order:
    • Postgres
    • Temporal
    • Workers/UI
  5. Verify tenant isolation, quotas, and audit integrity.

Postgres PITR Recovery

  1. Provision replacement Postgres cluster.
  2. Restore latest base backup.
  3. Replay WAL to target timestamp.
  4. Run cmd/migrate --cmd up.
  5. Validate row counts for critical tables:
    • tenants
    • tool_registries
    • audit_events
    • tenant_usage_daily
  6. Switch application DSN and restart workloads.

Temporal Recovery

  1. Confirm cluster membership and frontend availability.
  2. Validate namespace registration and replication status.
  3. Restart workers once Temporal health is green.
  4. Reconcile stuck workflows:
    • query running workflows,
    • terminate or retry long-stuck runs by policy.

Tool Registry Recovery

  1. Restore registries from backup JSON:
    • go run ./cmd/backup registry-import --in backups/tool-registries-*.json
  2. Verify hash immutability checks pass.
  3. Re-seed default/global registry if required.

Audit Recovery and Verification

  1. Restore archive objects from object store if needed.
  2. If audit_events was restored, verify hash chains by tenant using /api/audit/verify.
  3. Confirm no cross-tenant leakage in audit reads.

DR Test Script (Quarterly)

Run a controlled simulation in staging:

#!/usr/bin/env bash
set -euo pipefail

# 1) Scale workers down in primary region
kubectl -n cruvero scale deploy/cruvero-worker --replicas=0

# 2) Validate failover region serves traffic
curl -fsS https://cruvero.example.com/healthz

# 3) Run backup roundtrip test
go run ./cmd/backup dump --out /tmp/dr-test.dump
go run ./cmd/backup restore --in /tmp/dr-test.dump --clean

# 4) Validate core health endpoints
curl -fsS https://cruvero.example.com/api/health
curl -fsS https://cruvero.example.com/api/health/detail

Post-Recovery Checklist

  • All critical checks report healthy or approved degraded.
  • Workflow execution and task queue latency back within SLO.
  • Quota enforcement and audit logging confirmed working.
  • LLM failover chain reports expected provider status.
  • Incident notes completed with timeline and RCA owner assigned.
  • Backup/restore drill script: scripts/ops/backup-restore-drill.sh
  • DR readiness checklist: docs/operations/checklists/dr-readiness.md
  • Alert rules: deploy/monitoring/prometheus-rules.yaml, deploy/monitoring/loki-alert-rules.yaml