HA Deployment Guide
This guide describes a production multi-region deployment for Cruvero workers, UI, Temporal, Postgres, and Dragonfly.
Topology
- Region A and Region B each run:
cruvero-worker(active-active)cruvero-ui(active-active)- local Dragonfly (optional cache/rate-limit tier)
- Temporal is deployed as either:
- Temporal Cloud namespace with multi-region failover, or
- self-hosted multi-cluster Temporal with namespace replication.
- Postgres is deployed with HA (Patroni or managed HA service) and cross-region replication.
- Global traffic manager (Cloudflare/Route53/GCLB) routes UI and worker traffic by health.
Temporal HA
- Use one namespace per tenant if strict isolation is needed.
- Configure namespace replication and failover priorities across clusters.
- Ensure history/matching/frontend services are spread across zones.
- Monitor replication lag and namespace failover events.
Recommended SLO targets:
- Workflow task schedule-to-start p95:
< 2s - Activity schedule-to-start p95:
< 3s - Namespace replication lag p95:
< 30s
Postgres HA
- Primary + synchronous standby in each region where possible.
- Cross-region async replica for DR.
- Use connection pooling (PgBouncer) in front of database endpoints.
- Backups:
- WAL archiving + daily base backup.
cmd/backup dumpscheduled as defense-in-depth snapshot.
Dragonfly Strategy
- For strongest durability, keep quota/audit canonical data in Postgres.
- If Dragonfly is used for quota/rate acceleration, configure replicas and persistence snapshots.
- Treat Dragonfly as recoverable cache unless business policy requires strict persistence.
Worker Topology
- Active-active workers in both regions on same Temporal task queues.
- Use Kubernetes anti-affinity + topology spread constraints.
- Keep per-pod limits aligned with expected tool-call concurrency.
- Configure LLM failover chain:
CRUVERO_LLM_FAILOVER_CHAIN=openrouter,azureCRUVERO_LLM_FAILOVER_THRESHOLD=3
DNS and Failover
- UI: health-based routing with
GET /healthzandGET /readyz. - Workers: if region unhealthy, scale down or remove from LB/mesh target sets.
- Recommended failover timing:
- detection <= 30s
- route convergence <= 60s
Latency Budgets
Budget cross-region call latency so total step latency remains predictable.
- Temporal API roundtrip: target
< 150ms - Postgres query p95: target
< 100ms - LLM provider call p95: target
< 4s - End-to-end agent step p95: target
< 8s
If cross-region tool calls exceed budget:
- pin specific MCP tools to same region as workers,
- route outbound calls through regional egress,
- reduce synchronous tool calls per step.
Deployment Sequence
- Apply
deploy/kubernetes/migration-job.yaml. - Wait for migration job completion.
- Deploy worker and UI manifests.
- Apply
hpa.yaml,pdb.yaml, andnetwork-policy.yaml. - Verify:
GET /healthzandGET /readyzhealthy for UI and workers.- Temporal namespace reachable from both regions.
- LLM failover health details visible in
/api/health/detail.
Related Artifacts
- Alert rules:
deploy/monitoring/prometheus-rules.yaml,deploy/monitoring/loki-alert-rules.yaml - HA game-day script:
scripts/ops/ha-failover-game-day.sh - Security posture checklist:
docs/operations/checklists/security-posture.md