Disaster Recovery Playbook

This runbook defines recovery targets and step-by-step actions for region/database/provider incidents.

RPO / RTO Targets

Component	RPO Target	RTO Target	Notes
Temporal state/history	<= 5 minutes	<= 30 minutes	Depends on Temporal replication topology
Postgres (tenant, registry, quotas, audit)	<= 5 minutes	<= 30 minutes	Requires WAL archival + tested restore
Dragonfly cache	<= 1 hour	<= 15 minutes	Treat as rebuildable unless strict persistence configured
Tool registry exports	<= 24 hours	<= 15 minutes	Can be restored with `backup registry-import`
Audit archive objects	<= 24 hours	<= 1 hour	S3/object-store durability assumed

Incident Types

Region outage
Postgres corruption or accidental data loss
Temporal cluster outage
LLM provider prolonged outage

Recovery Procedure

Detect and declare incident.
Freeze risky mutating operations (tenant writes, registry imports, bulk updates).
Capture current status snapshots:
- /api/health
- /api/health/detail
- Temporal cluster and namespace health
Restore dependencies in order:
- Postgres
- Temporal
- Workers/UI
Verify tenant isolation, quotas, and audit integrity.

Postgres PITR Recovery

Provision replacement Postgres cluster.
Restore latest base backup.
Replay WAL to target timestamp.
Run cmd/migrate --cmd up.
Validate row counts for critical tables:
- tenants
- tool_registries
- audit_events
- tenant_usage_daily
Switch application DSN and restart workloads.

Temporal Recovery

Confirm cluster membership and frontend availability.
Validate namespace registration and replication status.
Restart workers once Temporal health is green.
Reconcile stuck workflows:
- query running workflows,
- terminate or retry long-stuck runs by policy.

Tool Registry Recovery

Restore registries from backup JSON:
- go run ./cmd/backup registry-import --in backups/tool-registries-*.json
Verify hash immutability checks pass.
Re-seed default/global registry if required.

Audit Recovery and Verification

Restore archive objects from object store if needed.
If audit_events was restored, verify hash chains by tenant using /api/audit/verify.
Confirm no cross-tenant leakage in audit reads.

DR Test Script (Quarterly)

Run a controlled simulation in staging:

#!/usr/bin/env bash
set -euo pipefail

# 1) Scale workers down in primary region
kubectl -n cruvero scale deploy/cruvero-worker --replicas=0

# 2) Validate failover region serves traffic
curl -fsS https://cruvero.example.com/healthz

# 3) Run backup roundtrip test
go run ./cmd/backup dump --out /tmp/dr-test.dump
go run ./cmd/backup restore --in /tmp/dr-test.dump --clean

# 4) Validate core health endpoints
curl -fsS https://cruvero.example.com/api/health
curl -fsS https://cruvero.example.com/api/health/detail

Post-Recovery Checklist

All critical checks report healthy or approved degraded.
Workflow execution and task queue latency back within SLO.
Quota enforcement and audit logging confirmed working.
LLM failover chain reports expected provider status.
Incident notes completed with timeline and RCA owner assigned.

Backup/restore drill script: scripts/ops/backup-restore-drill.sh
DR readiness checklist: docs/operations/checklists/dr-readiness.md
Alert rules: deploy/monitoring/prometheus-rules.yaml, deploy/monitoring/loki-alert-rules.yaml

RPO / RTO Targets​

Incident Types​

Recovery Procedure​

Postgres PITR Recovery​

Temporal Recovery​

Tool Registry Recovery​

Audit Recovery and Verification​

DR Test Script (Quarterly)​

Post-Recovery Checklist​

Related Artifacts​

Related Docs​

RPO / RTO Targets

Incident Types

Recovery Procedure

Postgres PITR Recovery

Temporal Recovery

Tool Registry Recovery

Audit Recovery and Verification

DR Test Script (Quarterly)

Post-Recovery Checklist

Related Artifacts

Related Docs