Skip to main content

Context Management

The enterprise context management system optimizes how Cruvero builds LLM prompts. It adds accurate BPE tokenization, provider-level prompt caching, observation masking, tool schema compression, multi-turn conversation, rolling summaries, per-tenant budget policies, proactive compression, a composable context pipeline, OTel monitoring, waste detection, and serialization optimization.

Source: internal/agent/tokenizer.go, internal/agent/context_assembler.go, internal/agent/context_budget.go, internal/agent/context_pipeline.go, internal/agent/conversation.go, internal/registry/schema_compressor.go

Context Pipeline

The context assembly pipeline runs as a sequence of stages, each independently testable:

ContextPipeline.Execute(state)
├─ stageDetectPhase → planning|executing|reviewing
├─ stageAllocateBudget → per-section token budgets (with tenant overrides)
├─ stageMaskObservations → replace old observations with one-line refs
├─ stageCompressSchemas → minify/truncate/aggressive schema compression
├─ stageBuildConversation → sliding window multi-turn (if enabled)
├─ stageAssembleContext → deterministic section assembly
└─ stageProactiveCompression → utilization check + escalating compression

Tokenization

Two modes are available:

  • BPE (default): Uses tiktoken-go (MIT, pure Go) for accurate token counting with model-specific encodings (cl100k_base for Claude/GPT-4, o200k_base for GPT-4o). Target accuracy: ±2%.
  • Heuristic (fallback): Original chars-per-token estimation. ±15-20% error but zero external dependencies.

The Tokenizer interface (CountTokens(text string) int) is used throughout the agent package. The active implementation is resolved once per model via resolveTokenizer().

Prompt Caching

Provider-level prompt caching reduces input costs by reusing cached system/tool blocks across requests.

  • Anthropic: Explicit cache_control markers on system and tool content blocks. Requires stable prefix ordering (already in place).
  • OpenAI: Automatic caching based on content hashing. Cruvero reads cached_tokens from usage response.
  • Google Gemini: Uses the explicit CachedContent API with a cache manager for TTL-based invalidation.
  • OpenRouter: Passes cache_control hints for Anthropic-backed models; no-ops for others.

Cache metrics are tracked via Usage.CacheReadTokens and Usage.CacheWriteTokens.

Observation Masking

After a tool result has been consumed by the LLM (used in a decision step), the full observation is replaced with a one-line reference: [Tool: \{name\} — \{status\}, \{n\} tokens, step \{i\}]. Only the most recent N observations (configurable via CRUVERO_OBSERVATION_MASK_WINDOW) are kept in full.

This reduces context waste from stale tool outputs, saving 30-60% on steps with old observations.

Tool Schema Compression

Tool schemas are compressed before prompt assembly at four levels:

LevelStrategySavings
nonePass-through0%
minifyRemove whitespace, compact JSON10-20%
truncateRemove descriptions, keep type + required30-50%
aggressiveOne-line key=value format, drop optional fields50-70%

Compression is applied per-tool during the stageCompressSchemas pipeline stage.

Multi-Turn Conversation

When enabled, the conversation builder maintains a sliding window of prior assistant/user turns in AgentState.ConversationHistory. This gives the LLM access to its own prior reasoning, improving coherence across multi-step workflows.

The window size is configurable. Older turns are dropped from the window but preserved in episodic memory.

Rolling Summaries

Replaces the default one-shot summarization with an incremental rolling approach:

  • One-shot (default): Each summary replaces the previous entirely.
  • Rolling: New summary incorporates the previous summary, producing a fixed N-bullet output that preserves critical information across multiple summarization rounds.

Per-Tenant Budget Policies

The hardcoded phase budget percentages (plan/execute/review) can be overridden per-tenant via TenantConfig.Metadata["context_policy"]:

{
"phase_overrides": {
"executing": {
"tools": 40,
"semantic": 15,
"working": 25,
"episodic": 10,
"procedural": 5,
"reserved": 5
}
}
}

Zero values use defaults. Non-zero values replace the default percentage for that section in that phase.

Proactive Compression

Instead of waiting for reactive overflow truncation, proactive compression checks utilization after the first assembly pass. If above threshold (default 85%), it re-assembles with escalating strategies:

  1. Schema compression (escalate level)
  2. Mask window reduction (fewer full observations)
  3. Fact deduplication (remove duplicates across sections)
  4. Episodic trimming (keep only last 3 episodes)

Each strategy is applied incrementally until utilization drops below threshold.

Context Waste Detection

Tracks which tools were included in context but never called. The AssembledContext.WastedTools field lists unused tools, and WasteRatio measures the ratio of wasted to included tools. This feeds back into tool quality scoring (Phase 19) for better tool selection over time.

Configuration

VariableDefaultDescription
CRUVERO_TOKENIZER_MODEbpeToken counting: bpe or heuristic
CRUVERO_PROMPT_CACHE_ENABLEDtrueProvider-level prompt caching
CRUVERO_OBSERVATION_MASK_ENABLEDtrueMask consumed observations
CRUVERO_OBSERVATION_MASK_WINDOW2Recent full observations to keep
CRUVERO_TOOL_SCHEMA_COMPRESSIONtruncateCompression level: none, minify, truncate, aggressive
CRUVERO_CONVERSATION_ENABLEDfalseMulti-turn conversation builder
CRUVERO_CONVERSATION_WINDOW5Max turns in sliding window
CRUVERO_SUMMARY_MODEoneshotSummary mode: rolling or oneshot
CRUVERO_SUMMARY_MAX_BULLETS5Max bullets in rolling summary
CRUVERO_COMPRESSION_THRESHOLD0.85Proactive compression utilization trigger
CRUVERO_CONTEXT_WASTE_TRACKINGfalseEnable waste detection metrics

All variables are per-tenant overridable via TenantConfig.Metadata using the variable name without CRUVERO_ prefix in lowercase.