Skip to main content

HA Deployment Guide

This guide describes a production multi-region deployment for Cruvero workers, UI, Temporal, Postgres, and Dragonfly.

Topology

  • Region A and Region B each run:
    • cruvero-worker (active-active)
    • cruvero-ui (active-active)
    • local Dragonfly (optional cache/rate-limit tier)
  • Temporal is deployed as either:
    • Temporal Cloud namespace with multi-region failover, or
    • self-hosted multi-cluster Temporal with namespace replication.
  • Postgres is deployed with HA (Patroni or managed HA service) and cross-region replication.
  • Global traffic manager (Cloudflare/Route53/GCLB) routes UI and worker traffic by health.

Temporal HA

  1. Use one namespace per tenant if strict isolation is needed.
  2. Configure namespace replication and failover priorities across clusters.
  3. Ensure history/matching/frontend services are spread across zones.
  4. Monitor replication lag and namespace failover events.

Recommended SLO targets:

  • Workflow task schedule-to-start p95: < 2s
  • Activity schedule-to-start p95: < 3s
  • Namespace replication lag p95: < 30s

Postgres HA

  • Primary + synchronous standby in each region where possible.
  • Cross-region async replica for DR.
  • Use connection pooling (PgBouncer) in front of database endpoints.
  • Backups:
    • WAL archiving + daily base backup.
    • cmd/backup dump scheduled as defense-in-depth snapshot.

Dragonfly Strategy

  • For strongest durability, keep quota/audit canonical data in Postgres.
  • If Dragonfly is used for quota/rate acceleration, configure replicas and persistence snapshots.
  • Treat Dragonfly as recoverable cache unless business policy requires strict persistence.

Worker Topology

  • Active-active workers in both regions on same Temporal task queues.
  • Use Kubernetes anti-affinity + topology spread constraints.
  • Keep per-pod limits aligned with expected tool-call concurrency.
  • Configure LLM failover chain:
    • CRUVERO_LLM_FAILOVER_CHAIN=openrouter,azure
    • CRUVERO_LLM_FAILOVER_THRESHOLD=3

DNS and Failover

  • UI: health-based routing with GET /healthz and GET /readyz.
  • Workers: if region unhealthy, scale down or remove from LB/mesh target sets.
  • Recommended failover timing:
    • detection <= 30s
    • route convergence <= 60s

Latency Budgets

Budget cross-region call latency so total step latency remains predictable.

  • Temporal API roundtrip: target < 150ms
  • Postgres query p95: target < 100ms
  • LLM provider call p95: target < 4s
  • End-to-end agent step p95: target < 8s

If cross-region tool calls exceed budget:

  • pin specific MCP tools to same region as workers,
  • route outbound calls through regional egress,
  • reduce synchronous tool calls per step.

Deployment Sequence

  1. Apply deploy/kubernetes/migration-job.yaml.
  2. Wait for migration job completion.
  3. Deploy worker and UI manifests.
  4. Apply hpa.yaml, pdb.yaml, and network-policy.yaml.
  5. Verify:
    • GET /healthz and GET /readyz healthy for UI and workers.
    • Temporal namespace reachable from both regions.
    • LLM failover health details visible in /api/health/detail.
  • Alert rules: deploy/monitoring/prometheus-rules.yaml, deploy/monitoring/loki-alert-rules.yaml
  • HA game-day script: scripts/ops/ha-failover-game-day.sh
  • Security posture checklist: docs/operations/checklists/security-posture.md