Skip to main content

Alerting

Cruvero ships Prometheus and Loki alert rules as code for monitoring runtime health, quota enforcement, audit integrity, LLM availability, and security signals. This document describes each alert category, default thresholds, and recommended responses.

Prerequisites: Prometheus with Alertmanager, Loki (or compatible log aggregation), Grafana (optional, for dashboards).

Alert Rule Sources

Alert rules are provided as code in:

  • deploy/monitoring/prometheus-rules.yaml — Metric-based alerts (counters, gauges, histograms)
  • deploy/monitoring/loki-alert-rules.yaml — Log-based alerts (structured log patterns)
  • deploy/monitoring/README.md — Setup instructions and label conventions

Alert Categories

Health Probe Failures

What it monitors: Worker and UI health endpoints returning non-200 responses or timing out.

Default threshold: 3 consecutive probe failures within the check interval (CRUVERO_HEALTH_CHECK_INTERVAL, default 30s).

Recommended response: Check worker logs for startup errors or database connectivity issues. See Incident Response.

Quota Critical Threshold

What it monitors: Tenant usage approaching or exceeding configured quota limits (requests per minute, tokens per day, cost).

Default threshold: Warning at 80% (CRUVERO_QUOTA_WARNING_THRESHOLD), critical at 95% (CRUVERO_QUOTA_CRITICAL_THRESHOLD).

Recommended response: Review tenant usage patterns. Consider increasing quotas or enabling model downgrade (CRUVERO_QUOTA_DOWNGRADE_MODEL). See Scaling.

Audit Writer Backpressure

What it monitors: Audit buffer reaching capacity, indicating the audit writer cannot keep up with event volume.

Default threshold: Buffer utilization above 80% of CRUVERO_AUDIT_BUFFER_SIZE (default 50).

Recommended response: Check Postgres write latency for the audit database. Consider increasing buffer size or using a dedicated audit DSN (CRUVERO_AUDIT_POSTGRES_URL).

LLM Failover Churn

What it monitors: Frequent provider failovers indicating instability in the LLM provider chain.

Default threshold: More than CRUVERO_LLM_FAILOVER_THRESHOLD (default 3) failovers within the recovery interval.

Recommended response: Check provider status pages. Review failover chain order (CRUVERO_LLM_FAILOVER_CHAIN). Consider adjusting latency threshold or recovery interval. See Model Rotation.

Security Signal Spikes

What it monitors: Elevated rates of security events:

  • network_policy_denied — Network policy blocking outbound requests
  • output_filter_blocked — Output filter redacting or blocking responses
  • injection_detected_total — Input sanitization detecting injection attempts

Default threshold: Configurable delta thresholds in immune alert settings.

Recommended response: Investigate the source tenant and workflow. Check for prompt injection patterns or misconfigured tools. See Security Incident.

Applying Alert Rules

Adjust label selectors and metric names to match your environment, then apply:

kubectl apply -f deploy/monitoring/prometheus-rules.yaml
kubectl apply -f deploy/monitoring/loki-alert-rules.yaml

Customizing Thresholds

Alert thresholds can be customized by editing the rule YAML files directly. Key values to tune:

  • Probe intervals — Adjust for: duration in health probe rules
  • Quota thresholds — Modify percentage thresholds in quota rules to match your SLAs
  • Failover sensitivity — Adjust the failover count threshold based on your provider reliability
  • Security baselines — Set delta thresholds based on your normal traffic patterns

Adding New Alerts

To add a custom alert:

  1. Add the rule to the appropriate YAML file (prometheus-rules.yaml for metrics, loki-alert-rules.yaml for logs).
  2. Follow the existing label conventions (severity: warning|critical, team: platform).
  3. Include an annotations.runbook_url pointing to the relevant runbook.
  4. Test the rule with promtool check rules prometheus-rules.yaml.
  5. Apply and verify in a staging environment before production.

Notes

  • Prometheus rules assume metric export for selected counters and gauges.
  • Loki rules work from existing structured log lines and can be used immediately where logs are centralized.
  • All alerts should route through Alertmanager with appropriate silencing and escalation policies.