Source:
docs/manual/prompt-library-v2.mdThis page is generated by
site/scripts/sync-manual-docs.mjs.
Prompt Library v2
Advanced prompt management extensions for deployment environments, composable snippets, A/B experimentation, structured evaluation, version diffing, CI/CD integration, and prompt analytics.
Source: internal/promptlib/*
Overview
Phase 26 extends the Prompt Library with lifecycle management features that bridge the gap between "prompts exist in a catalog" and "prompts are safely managed across their lifecycle in production." All extensions are backward-compatible — when disabled, the system behaves identically to the base Phase 18 prompt library.
Deployment Environments
Prompts are promoted through named environments (default: dev → staging → production). Each promotion is an assignment — the immutable prompt version is linked to the environment, not copied.
prompt created → dev → staging → production
↑ ↑ ↑
quality gates enforce thresholds at each transition
EnvironmentStore.Promoteupserts the assignment and appends to promotion historyEnvironmentStore.GetActiveresolves the current prompt version for an environment- Searcher filters results by environment when
SearchQuery.Environmentis set - When environments are disabled, all prompts are visible (Phase 18 behavior)
Quality Gates
Promotion can be gated on quality thresholds:
| Gate | Description |
|---|---|
MinUsageCount | Minimum number of agent invocations before promotion |
MinSuccessRate | Minimum success rate (0.0–1.0) from prompt_metrics |
MinAvgRating | Minimum average LLM rating (0.0–1.0) |
RequireEvalPass | Require a passing eval run before promotion |
All conditions must pass. Each failure includes a human-readable reason (e.g., "success_rate 0.72 < minimum 0.80").
Composable Snippets
Prompts can reference other prompts as composable fragments using Go template syntax:
{{snippet "safety-guardrails"}}
{{snippet "output-format" "v3"}}
{{snippet "preamble" "production"}}
- First argument: snippet prompt ID
- Optional second argument: version number or environment label
- Resolution: by version → by environment label → latest
- Cycle detection prevents infinite loops (max depth configurable, default 3)
- Snippet dependencies tracked in
prompt_snippet_refstable
A/B Experiments
Controlled prompt A/B testing using Temporal SideEffect for replay-safe variant selection:
- Traffic split by percentage (variants must sum to 100%)
- Variant selection is deterministic on replay (seeded from RunID + PromptID)
- Outcomes recorded fire-and-forget (non-blocking)
- Auto-completion when sample size threshold reached (promotes winner)
- Only one active experiment per prompt at a time
Evaluation Framework
Structured evaluation with datasets, scorers, and a Temporal workflow orchestrator:
Datasets
Versioned collections of input/expected-output pairs. Created from JSON, YAML, CSV, or production logs. Copy-on-write versioning ensures immutable dataset snapshots.
Built-in Scorers
| Scorer | Description |
|---|---|
exact_match | Binary 1.0/0.0 on exact string match |
contains | 1.0 if all required substrings found |
regex | 1.0 if output matches pattern |
cosine_similarity | Embedding-based similarity (0.0–1.0) |
llm_judge | LLM-as-a-judge quality rating (0.0–1.0) |
EvalRunWorkflow
Temporal workflow that processes dataset entries with configurable concurrency:
- Load prompt and dataset
- For each entry: render template → call LLM → run scorers → store result
- Aggregate into
EvalSummary(per-scorer averages, pass rate, latency, cost) - Individual entry failures do not abort the run
Version Diff
Line-level diff between prompt versions with metadata comparison:
- Myers diff algorithm produces add/delete/modify/equal hunks
- Context lines configurable (default 3)
- Summary-only mode for prompts > 10KB
- Metadata diff detects parameter, tag, and type changes
- Available via API (
/api/prompts/{id}/diff) and CLI (prompt-diff)
CI/CD Integration
The prompt-eval CLI is designed for CI/CD pipelines:
--ciflag for machine-readable JSON output--github-summaryfor GitHub Actions job summary markdown--regression-baseline autoto find the most recent passing baseline--format markdownfor PR comment comparison tables- Exit code 0 on pass, 1 on fail (CI/CD compatible)
Example GitHub Actions usage:
prompt-eval \
--prompt-hash "$HASH" \
--dataset "$DATASET_ID" \
--scorers "exact_match,llm_judge" \
--threshold 0.80 \
--fail-on-regression \
--ci --github-summary
Production Log → Dataset Pipeline
Converts production agent runs into eval datasets via the audit log:
- Successful runs become entries with
expected_output= actual output - Failed runs become entries flagged for human review
- Configurable time range, entry limits, and failure-only filtering
- Creates regression test suites from real production data
prompt-dataset --from-logs \
--prompt-hash abc123 \
--since 168h \
--max-entries 500 \
--failures-only
NATS Cache Invalidation
When a prompt is promoted, a NATS event (prompt.promoted) triggers immediate cache invalidation across all agents. Without NATS, agents fall back to TTL-based expiry (default 5 minutes).
- Publisher: best-effort (failure does not roll back promotion)
- Subscriber: read-through cache with TTL fallback
- Subject: configurable (default
cruvero.prompts.events)
Provider Blueprints
Provider-agnostic intermediate representation between prompt content and LLM API calls:
| Adapter | Format |
|---|---|
OpenAIAdapter | OpenAI chat completion format |
AnthropicAdapter | Anthropic messages format (system extracted) |
AzureAdapter | Azure OpenAI with deployment name mapping |
RenderToBlueprint maps PromptType to message roles (system → system message, user → user message, task → system + user pair, chain_of_thought → system with CoT + user).
Prompt Analytics
Time-series queries over prompt usage and quality metrics:
GetTimeSeries: usage_count, success_rate, avg_rating, failure_count bucketed by hour/day/weekGetTopPrompts: prompts ranked by metric over time rangeGetPromptComparison: side-by-side metrics for multiple prompts
API endpoints:
GET /api/prompts/{hash}/analytics?metric=usage_count&interval=dayGET /api/prompts/rankings?metric=success_rate&limit=10GET /api/prompts/compare?hashes=abc,def,ghi
Configuration
| Variable | Default | Description |
|---|---|---|
CRUVERO_PROMPTLIB_ENVS_ENABLED | true | Enable deployment environments |
CRUVERO_PROMPTLIB_DEFAULT_ENVS | dev,staging,production | Environment names created per tenant |
CRUVERO_PROMPTLIB_SNIPPETS_ENABLED | true | Enable snippet composition |
CRUVERO_PROMPTLIB_SNIPPET_MAX_DEPTH | 3 | Max nested snippet depth |
CRUVERO_PROMPTLIB_EXPERIMENTS_ENABLED | false | Enable A/B experimentation |
CRUVERO_PROMPTLIB_EXPERIMENT_MAX_VARIANTS | 4 | Max variants per experiment |
CRUVERO_PROMPTLIB_EVAL_ENABLED | true | Enable evaluation framework |
CRUVERO_PROMPTLIB_EVAL_TIMEOUT | 300s | Per-entry eval timeout |
CRUVERO_PROMPTLIB_EVAL_MAX_CONCURRENT | 10 | Max concurrent eval entries |
CRUVERO_PROMPTLIB_DIFF_CONTEXT_LINES | 3 | Lines of context in diff output |
CRUVERO_PROMPTLIB_NATS_CACHE_ENABLED | false | Enable NATS cache invalidation |
CRUVERO_PROMPTLIB_NATS_SUBJECT | cruvero.prompts.events | NATS subject for prompt events |
CRUVERO_PROMPTLIB_BLUEPRINT_ENABLED | false | Enable provider-agnostic blueprints |
CRUVERO_PROMPTLIB_ANALYTICS_RETENTION | 90d | Analytics data retention period |
CLI Tools
| CLI | Description |
|---|---|
prompt-eval | Run eval against dataset, exit 0/1 for CI/CD |
prompt-dataset | Create/manage eval datasets (JSON, YAML, CSV, logs) |
prompt-experiment | Create/list/complete A/B experiments |
prompt-diff | Diff prompt versions with colored output |
Related Docs
- Prompt Library — Base prompt library (Phase 18)
- Memory System — Salience scoring patterns reused by search ranking
- Tools and Registry — Tool interface pattern used by
prompt_promote - Configuration and Environment — Env var conventions