Alerting
Cruvero ships Prometheus and Loki alert rules as code for monitoring runtime health, quota enforcement, audit integrity, LLM availability, and security signals. This document describes each alert category, default thresholds, and recommended responses.
Prerequisites: Prometheus with Alertmanager, Loki (or compatible log aggregation), Grafana (optional, for dashboards).
Alert Rule Sources
Alert rules are provided as code in:
deploy/monitoring/prometheus-rules.yaml— Metric-based alerts (counters, gauges, histograms)deploy/monitoring/loki-alert-rules.yaml— Log-based alerts (structured log patterns)deploy/monitoring/README.md— Setup instructions and label conventions
Alert Categories
Health Probe Failures
What it monitors: Worker and UI health endpoints returning non-200 responses or timing out.
Default threshold: 3 consecutive probe failures within the check interval (CRUVERO_HEALTH_CHECK_INTERVAL, default 30s).
Recommended response: Check worker logs for startup errors or database connectivity issues. See Incident Response.
Quota Critical Threshold
What it monitors: Tenant usage approaching or exceeding configured quota limits (requests per minute, tokens per day, cost).
Default threshold: Warning at 80% (CRUVERO_QUOTA_WARNING_THRESHOLD), critical at 95% (CRUVERO_QUOTA_CRITICAL_THRESHOLD).
Recommended response: Review tenant usage patterns. Consider increasing quotas or enabling model downgrade (CRUVERO_QUOTA_DOWNGRADE_MODEL). See Scaling.
Audit Writer Backpressure
What it monitors: Audit buffer reaching capacity, indicating the audit writer cannot keep up with event volume.
Default threshold: Buffer utilization above 80% of CRUVERO_AUDIT_BUFFER_SIZE (default 50).
Recommended response: Check Postgres write latency for the audit database. Consider increasing buffer size or using a dedicated audit DSN (CRUVERO_AUDIT_POSTGRES_URL).
LLM Failover Churn
What it monitors: Frequent provider failovers indicating instability in the LLM provider chain.
Default threshold: More than CRUVERO_LLM_FAILOVER_THRESHOLD (default 3) failovers within the recovery interval.
Recommended response: Check provider status pages. Review failover chain order (CRUVERO_LLM_FAILOVER_CHAIN). Consider adjusting latency threshold or recovery interval. See Model Rotation.
Security Signal Spikes
What it monitors: Elevated rates of security events:
network_policy_denied— Network policy blocking outbound requestsoutput_filter_blocked— Output filter redacting or blocking responsesinjection_detected_total— Input sanitization detecting injection attempts
Default threshold: Configurable delta thresholds in immune alert settings.
Recommended response: Investigate the source tenant and workflow. Check for prompt injection patterns or misconfigured tools. See Security Incident.
Applying Alert Rules
Adjust label selectors and metric names to match your environment, then apply:
kubectl apply -f deploy/monitoring/prometheus-rules.yaml
kubectl apply -f deploy/monitoring/loki-alert-rules.yaml
Customizing Thresholds
Alert thresholds can be customized by editing the rule YAML files directly. Key values to tune:
- Probe intervals — Adjust
for:duration in health probe rules - Quota thresholds — Modify percentage thresholds in quota rules to match your SLAs
- Failover sensitivity — Adjust the failover count threshold based on your provider reliability
- Security baselines — Set delta thresholds based on your normal traffic patterns
Adding New Alerts
To add a custom alert:
- Add the rule to the appropriate YAML file (
prometheus-rules.yamlfor metrics,loki-alert-rules.yamlfor logs). - Follow the existing label conventions (
severity: warning|critical,team: platform). - Include an
annotations.runbook_urlpointing to the relevant runbook. - Test the rule with
promtool check rules prometheus-rules.yaml. - Apply and verify in a staging environment before production.
Notes
- Prometheus rules assume metric export for selected counters and gauges.
- Loki rules work from existing structured log lines and can be used immediately where logs are centralized.
- All alerts should route through Alertmanager with appropriate silencing and escalation policies.