Skip to main content

Source: docs/manual/prompt-library-v2.md

This page is generated by site/scripts/sync-manual-docs.mjs.

Prompt Library v2

Advanced prompt management extensions for deployment environments, composable snippets, A/B experimentation, structured evaluation, version diffing, CI/CD integration, and prompt analytics.

Source: internal/promptlib/*

Overview

Phase 26 extends the Prompt Library with lifecycle management features that bridge the gap between "prompts exist in a catalog" and "prompts are safely managed across their lifecycle in production." All extensions are backward-compatible — when disabled, the system behaves identically to the base Phase 18 prompt library.

Deployment Environments

Prompts are promoted through named environments (default: devstagingproduction). Each promotion is an assignment — the immutable prompt version is linked to the environment, not copied.

prompt created → dev → staging → production
↑ ↑ ↑
quality gates enforce thresholds at each transition
  • EnvironmentStore.Promote upserts the assignment and appends to promotion history
  • EnvironmentStore.GetActive resolves the current prompt version for an environment
  • Searcher filters results by environment when SearchQuery.Environment is set
  • When environments are disabled, all prompts are visible (Phase 18 behavior)

Quality Gates

Promotion can be gated on quality thresholds:

GateDescription
MinUsageCountMinimum number of agent invocations before promotion
MinSuccessRateMinimum success rate (0.0–1.0) from prompt_metrics
MinAvgRatingMinimum average LLM rating (0.0–1.0)
RequireEvalPassRequire a passing eval run before promotion

All conditions must pass. Each failure includes a human-readable reason (e.g., "success_rate 0.72 < minimum 0.80").

Composable Snippets

Prompts can reference other prompts as composable fragments using Go template syntax:

{{snippet "safety-guardrails"}}
{{snippet "output-format" "v3"}}
{{snippet "preamble" "production"}}
  • First argument: snippet prompt ID
  • Optional second argument: version number or environment label
  • Resolution: by version → by environment label → latest
  • Cycle detection prevents infinite loops (max depth configurable, default 3)
  • Snippet dependencies tracked in prompt_snippet_refs table

A/B Experiments

Controlled prompt A/B testing using Temporal SideEffect for replay-safe variant selection:

  • Traffic split by percentage (variants must sum to 100%)
  • Variant selection is deterministic on replay (seeded from RunID + PromptID)
  • Outcomes recorded fire-and-forget (non-blocking)
  • Auto-completion when sample size threshold reached (promotes winner)
  • Only one active experiment per prompt at a time

Evaluation Framework

Structured evaluation with datasets, scorers, and a Temporal workflow orchestrator:

Datasets

Versioned collections of input/expected-output pairs. Created from JSON, YAML, CSV, or production logs. Copy-on-write versioning ensures immutable dataset snapshots.

Built-in Scorers

ScorerDescription
exact_matchBinary 1.0/0.0 on exact string match
contains1.0 if all required substrings found
regex1.0 if output matches pattern
cosine_similarityEmbedding-based similarity (0.0–1.0)
llm_judgeLLM-as-a-judge quality rating (0.0–1.0)

EvalRunWorkflow

Temporal workflow that processes dataset entries with configurable concurrency:

  1. Load prompt and dataset
  2. For each entry: render template → call LLM → run scorers → store result
  3. Aggregate into EvalSummary (per-scorer averages, pass rate, latency, cost)
  4. Individual entry failures do not abort the run

Version Diff

Line-level diff between prompt versions with metadata comparison:

  • Myers diff algorithm produces add/delete/modify/equal hunks
  • Context lines configurable (default 3)
  • Summary-only mode for prompts > 10KB
  • Metadata diff detects parameter, tag, and type changes
  • Available via API (/api/prompts/{id}/diff) and CLI (prompt-diff)

CI/CD Integration

The prompt-eval CLI is designed for CI/CD pipelines:

  • --ci flag for machine-readable JSON output
  • --github-summary for GitHub Actions job summary markdown
  • --regression-baseline auto to find the most recent passing baseline
  • --format markdown for PR comment comparison tables
  • Exit code 0 on pass, 1 on fail (CI/CD compatible)

Example GitHub Actions usage:

prompt-eval \
--prompt-hash "$HASH" \
--dataset "$DATASET_ID" \
--scorers "exact_match,llm_judge" \
--threshold 0.80 \
--fail-on-regression \
--ci --github-summary

Production Log → Dataset Pipeline

Converts production agent runs into eval datasets via the audit log:

  • Successful runs become entries with expected_output = actual output
  • Failed runs become entries flagged for human review
  • Configurable time range, entry limits, and failure-only filtering
  • Creates regression test suites from real production data
prompt-dataset --from-logs \
--prompt-hash abc123 \
--since 168h \
--max-entries 500 \
--failures-only

NATS Cache Invalidation

When a prompt is promoted, a NATS event (prompt.promoted) triggers immediate cache invalidation across all agents. Without NATS, agents fall back to TTL-based expiry (default 5 minutes).

  • Publisher: best-effort (failure does not roll back promotion)
  • Subscriber: read-through cache with TTL fallback
  • Subject: configurable (default cruvero.prompts.events)

Provider Blueprints

Provider-agnostic intermediate representation between prompt content and LLM API calls:

AdapterFormat
OpenAIAdapterOpenAI chat completion format
AnthropicAdapterAnthropic messages format (system extracted)
AzureAdapterAzure OpenAI with deployment name mapping

RenderToBlueprint maps PromptType to message roles (system → system message, user → user message, task → system + user pair, chain_of_thought → system with CoT + user).

Prompt Analytics

Time-series queries over prompt usage and quality metrics:

  • GetTimeSeries: usage_count, success_rate, avg_rating, failure_count bucketed by hour/day/week
  • GetTopPrompts: prompts ranked by metric over time range
  • GetPromptComparison: side-by-side metrics for multiple prompts

API endpoints:

  • GET /api/prompts/{hash}/analytics?metric=usage_count&interval=day
  • GET /api/prompts/rankings?metric=success_rate&limit=10
  • GET /api/prompts/compare?hashes=abc,def,ghi

Configuration

VariableDefaultDescription
CRUVERO_PROMPTLIB_ENVS_ENABLEDtrueEnable deployment environments
CRUVERO_PROMPTLIB_DEFAULT_ENVSdev,staging,productionEnvironment names created per tenant
CRUVERO_PROMPTLIB_SNIPPETS_ENABLEDtrueEnable snippet composition
CRUVERO_PROMPTLIB_SNIPPET_MAX_DEPTH3Max nested snippet depth
CRUVERO_PROMPTLIB_EXPERIMENTS_ENABLEDfalseEnable A/B experimentation
CRUVERO_PROMPTLIB_EXPERIMENT_MAX_VARIANTS4Max variants per experiment
CRUVERO_PROMPTLIB_EVAL_ENABLEDtrueEnable evaluation framework
CRUVERO_PROMPTLIB_EVAL_TIMEOUT300sPer-entry eval timeout
CRUVERO_PROMPTLIB_EVAL_MAX_CONCURRENT10Max concurrent eval entries
CRUVERO_PROMPTLIB_DIFF_CONTEXT_LINES3Lines of context in diff output
CRUVERO_PROMPTLIB_NATS_CACHE_ENABLEDfalseEnable NATS cache invalidation
CRUVERO_PROMPTLIB_NATS_SUBJECTcruvero.prompts.eventsNATS subject for prompt events
CRUVERO_PROMPTLIB_BLUEPRINT_ENABLEDfalseEnable provider-agnostic blueprints
CRUVERO_PROMPTLIB_ANALYTICS_RETENTION90dAnalytics data retention period

CLI Tools

CLIDescription
prompt-evalRun eval against dataset, exit 0/1 for CI/CD
prompt-datasetCreate/manage eval datasets (JSON, YAML, CSV, logs)
prompt-experimentCreate/list/complete A/B experiments
prompt-diffDiff prompt versions with colored output