Immune Response Runbook
Procedures for operating the immune system during tool anomaly spikes, including detection, quarantine, vaccination, and safe release of affected tools.
Purpose
Operate the Phase 10D immune system safely during repeated tool failures:
- detect anomaly spikes
- vaccinate known failures
- quarantine risky tools
- release quarantines after verification
Triggers
- Rising
immune_anomaly_recordedalerts. - Repeated
immune_quarantine_blockedevents. - Auto-quarantine events for production-critical tools.
- New unresolved anomalies accumulating in the immune console.
Detection
- Open the immune console:
/immune.html. - Filter by tenant and inspect:
- unresolved anomalies (hash, tool, error type, hit count)
- active quarantines and reasons
- Confirm alert trend in logs:
immune alert metric=immune_anomaly_recorded ...immune alert metric=immune_auto_quarantine ...
Vaccination Procedure
- List unresolved anomalies:
go run ./cmd/vaccinate --list --tenant <tenant_id> - Pick the top hit-count signature and author a concrete fix procedure.
- Apply vaccination:
go run ./cmd/vaccinate \
--tenant <tenant_id> \
--signature-hash <hash> \
--procedure "Create branch before opening PR" \
--resolved-by <operator> - Verify:
- signature now has resolution
- procedural memory exists as
immune:<hash>
Quarantine Release Procedure
- Validate the fix in a safe run/replay.
- Release quarantine from UI (
/immune.html) or CLI:go run ./cmd/vaccinate \
--tenant <tenant_id> \
--release <tool_name> \
--resolved-by <operator> \
--reason "fix validated in replay run <id>" - Re-run workload and confirm no recurrence.
Snapshot & Retention Governance
- Worker exports resolved anomalies before retention cleanup when enabled:
CRUVERO_IMMUNE_SNAPSHOT_ENABLED=trueCRUVERO_IMMUNE_SNAPSHOT_DIR=backups/immuneCRUVERO_IMMUNE_SNAPSHOT_BATCH=1000
- Verify periodic snapshot files in snapshot directory.
- Ensure backup/retention policy for snapshot artifacts aligns with compliance requirements.
Escalation
Escalate to incident response when any condition holds:
- auto-quarantine impacts critical tooling
- anomaly growth exceeds threshold for two consecutive windows
- vaccine procedure does not reduce hit rate within expected observation window
Post-Incident Checklist
- Document root cause and final procedure text.
- Confirm quarantine state and release metadata.
- Confirm snapshot export for deleted resolved anomalies.
- Update relevant phase/manual docs if policy/env defaults changed.