Scaling Runbook
When to use: Worker throughput is degraded, task queue depth is growing, or step latency exceeds SLO targets for 10+ minutes.
Prerequisites:
kubectlaccess to the target Kubernetes cluster- Prometheus/Grafana dashboards for Temporal and worker metrics
- Permission to adjust HPA limits or node pool sizes
Trigger Conditions
Scale review is required when one or more conditions persist for 10+ minutes:
- Temporal task queue depth is growing continuously.
- Agent step latency p95 is above SLO (
> 8send-to-end). - Worker CPU
> 70%or memory> 80%sustained. - Workflow schedule-to-start latency p95 exceeds target (
> 2s).
Indicators and Thresholds
temporal_task_queue_depth: warn at>= 30, critical at>= 100.- Worker CPU utilization: warn at
>= 70%, critical at>= 85%. - Worker memory utilization: warn at
>= 80%, critical at>= 90%. - Workflow failure rate: warn at
>= 3%, critical at>= 8%.
Step-by-Step Actions
- Confirm health baseline:
curl -fsS http://<ui>/api/healthcurl -fsS http://<ui>/api/health/detail
- Validate bottleneck component:
- workers saturated,
- Temporal matching/history pressure,
- Postgres connection or query saturation.
- Scale workers immediately if queue depth is rising:
kubectl -n cruvero scale deployment/cruvero-worker --replicas=<N>
- If HPA is enabled, tune bounds/target:
- edit
deploy/kubernetes/hpa.yamland raisemaxReplicas, - adjust CPU and/or queue-depth target.
- edit
- Temporal scaling (platform/SRE step):
- increase matching/history capacity,
- add shards/partitions based on Temporal deployment model,
- verify frontend saturation is not the limiter.
- Postgres scaling:
- ensure PgBouncer pooling is in use,
- increase pool safely,
- route heavy reads to replicas where supported.
- Recheck queue drain and latency after each change before further scaling.
Verification
- Task queue depth trending downward for 15+ minutes.
- Worker utilization below warning thresholds.
- Step latency p95 returns to target budget.
- No new quota or timeout spikes in audit/metrics.
Escalation Path
- Escalate to Platform/SRE when:
- scaling workers does not reduce queue depth,
- Temporal services remain saturated after worker scale-out,
- Postgres saturation persists after pool tuning.
- Escalate to Application owners when:
- workload spike caused by runaway workflow/tool behavior,
- tenant-specific abusive pattern needs policy mitigation.