Skip to main content

Scaling Runbook

When to use: Worker throughput is degraded, task queue depth is growing, or step latency exceeds SLO targets for 10+ minutes.

Prerequisites:

  • kubectl access to the target Kubernetes cluster
  • Prometheus/Grafana dashboards for Temporal and worker metrics
  • Permission to adjust HPA limits or node pool sizes

Trigger Conditions

Scale review is required when one or more conditions persist for 10+ minutes:

  • Temporal task queue depth is growing continuously.
  • Agent step latency p95 is above SLO (> 8s end-to-end).
  • Worker CPU > 70% or memory > 80% sustained.
  • Workflow schedule-to-start latency p95 exceeds target (> 2s).

Indicators and Thresholds

  • temporal_task_queue_depth: warn at >= 30, critical at >= 100.
  • Worker CPU utilization: warn at >= 70%, critical at >= 85%.
  • Worker memory utilization: warn at >= 80%, critical at >= 90%.
  • Workflow failure rate: warn at >= 3%, critical at >= 8%.

Step-by-Step Actions

  1. Confirm health baseline:
    • curl -fsS http://<ui>/api/health
    • curl -fsS http://<ui>/api/health/detail
  2. Validate bottleneck component:
    • workers saturated,
    • Temporal matching/history pressure,
    • Postgres connection or query saturation.
  3. Scale workers immediately if queue depth is rising:
    • kubectl -n cruvero scale deployment/cruvero-worker --replicas=<N>
  4. If HPA is enabled, tune bounds/target:
    • edit deploy/kubernetes/hpa.yaml and raise maxReplicas,
    • adjust CPU and/or queue-depth target.
  5. Temporal scaling (platform/SRE step):
    • increase matching/history capacity,
    • add shards/partitions based on Temporal deployment model,
    • verify frontend saturation is not the limiter.
  6. Postgres scaling:
    • ensure PgBouncer pooling is in use,
    • increase pool safely,
    • route heavy reads to replicas where supported.
  7. Recheck queue drain and latency after each change before further scaling.

Verification

  • Task queue depth trending downward for 15+ minutes.
  • Worker utilization below warning thresholds.
  • Step latency p95 returns to target budget.
  • No new quota or timeout spikes in audit/metrics.

Escalation Path

  • Escalate to Platform/SRE when:
    • scaling workers does not reduce queue depth,
    • Temporal services remain saturated after worker scale-out,
    • Postgres saturation persists after pool tuning.
  • Escalate to Application owners when:
    • workload spike caused by runaway workflow/tool behavior,
    • tenant-specific abusive pattern needs policy mitigation.