HA Deployment Guide

This guide describes a production multi-region deployment for Cruvero workers, UI, Temporal, Postgres, and Dragonfly.

Topology

Region A and Region B each run:
- cruvero-worker (active-active)
- cruvero-ui (active-active)
- local Dragonfly (optional cache/rate-limit tier)
Temporal is deployed as either:
- Temporal Cloud namespace with multi-region failover, or
- self-hosted multi-cluster Temporal with namespace replication.
Postgres is deployed with HA (Patroni or managed HA service) and cross-region replication.
Global traffic manager (Cloudflare/Route53/GCLB) routes UI and worker traffic by health.

Recommended SLO targets:

Primary + synchronous standby in each region where possible.
Cross-region async replica for DR.
Use connection pooling (PgBouncer) in front of database endpoints.
Backups:
- WAL archiving + daily base backup.
- cmd/backup dump scheduled as defense-in-depth snapshot.

For strongest durability, keep quota/audit canonical data in Postgres.
If Dragonfly is used for quota/rate acceleration, configure replicas and persistence snapshots.
Treat Dragonfly as recoverable cache unless business policy requires strict persistence.

Active-active workers in both regions on same Temporal task queues.
Use Kubernetes anti-affinity + topology spread constraints.
Keep per-pod limits aligned with expected tool-call concurrency.
Configure LLM failover chain:
- CRUVERO_LLM_FAILOVER_CHAIN=openrouter,azure
- CRUVERO_LLM_FAILOVER_THRESHOLD=3

Budget cross-region call latency so total step latency remains predictable.

If cross-region tool calls exceed budget:

Apply deploy/kubernetes/migration-job.yaml.
Wait for migration job completion.
Deploy worker and UI manifests.
Apply hpa.yaml, pdb.yaml, and network-policy.yaml.
Verify:
- GET /healthz and GET /readyz healthy for UI and workers.
- Temporal namespace reachable from both regions.
- LLM failover health details visible in /api/health/detail.

Alert rules: deploy/monitoring/prometheus-rules.yaml, deploy/monitoring/loki-alert-rules.yaml
HA game-day script: scripts/ops/ha-failover-game-day.sh
Security posture checklist: docs/operations/checklists/security-posture.md