Temporal Agent Runtime — Roadmap
A production-grade, Temporal-native agent runtime that treats durability, observability, and operational control as first-class concerns—not afterthoughts.
Vision
Most agent frameworks optimize for demo speed. We optimize for production survival.
LangGraph bolted durability onto a graph abstraction. We invert this: Temporal's battle-tested workflow engine is the foundation, and the agent abstraction compiles down to it. The result is an agent runtime where retry logic, failure recovery, human-in-the-loop, and multi-agent coordination aren't library features—they're infrastructure guarantees.
Differentiators (Why This Exists)
| Capability | LangGraph | This Runtime |
|---|---|---|
| Durability | Library-level checkpointers | Infrastructure-grade (Temporal cluster) |
| Failure recovery | Manual retry config | Automatic, policy-driven, battle-tested |
| Human-in-the-loop | Interrupt API | Native signals + queries with timeout semantics |
| Replay/debugging | Thread history | Full workflow replay with deterministic re-execution |
| Multi-agent | Graph composition | First-class supervisor workflows, sagas, compensation |
| Observability | Logging hooks | Native OpenTelemetry, distributed tracing across agents |
| Time-travel debugging | Checkpoint inspection | Replay any workflow from any point in history |
| Long-running agents | Memory management | Continue-as-new, external state, bounded histories |
Phase 1: Durable Single-Agent Loop (MVP)
Goal: A minimal agent that survives crashes, retries failures, and runs on Temporal.
Deliverables
1.1 Core Workflow: AgentRunWorkflow
type AgentRunInput struct {
RunID string
InitialPrompt string
ToolRegistryRef ToolRegistryRef // immutable version reference
Config AgentConfig
}
type AgentConfig struct {
Model string
MaxSteps int
StepTimeout time.Duration
TotalTimeout time.Duration
}
- Deterministic state machine loop:
decide → act → observe → repeat - Clean stop conditions: max steps, goal achieved, timeout, explicit halt
- All non-determinism isolated to activities
1.2 Activities
| Activity | Purpose |
|---|---|
LLMDecideActivity | Call LLM with current state, return structured decision |
ToolExecuteActivity | Execute a tool with validated arguments |
ObserveActivity | Process tool results, update observations |
1.3 Tool Registry (Immutable + Versioned from Day 1)
type ToolRegistry struct {
ID string
Version string // semver, immutable once published
Tools []ToolDefinition
CreatedAt time.Time
Hash string // content-addressable
}
type ToolDefinition struct {
Name string
Description string
Schema JSONSchema // strict validation
Executor ToolExecutor // activity implementation
RetryPolicy *RetryPolicy // per-tool failure handling
}
Why versioned from Phase 1: Debugging agent behavior requires knowing exactly what tools it had. Mutable registries destroy forensics.
1.4 Observability (Production-Shaped from Start)
- Workflow-level: run ID, steps completed, current state hash
- Activity-level: latency, token usage, retry count
- OpenTelemetry spans: one span per decision cycle, child spans for tool calls
- Structured logs with correlation IDs
1.5 Example Project
- 3 tools:
http_get,calculator,key_value_store - One agent: "Research assistant that can fetch URLs, compute, and remember facts"
- Demonstrates: retry on HTTP failure, timeout handling, clean shutdown
Exit Criteria
- Agent survives worker restart mid-run
- Failed tool calls retry with backoff
- Full trace visible in Jaeger/Tempo
- Tool schema validation rejects malformed calls
Phase 2: Checkpoints, Replay, and Human-in-the-Loop
Goal: Match LangGraph's operational UX (pause/resume, debugging, interrupts) using Temporal primitives.
Deliverables
2.1 Interrupt System (Signals + Queries)
// Workflow can pause and wait for human input
func (w *AgentWorkflow) WaitForApproval(ctx workflow.Context, request ApprovalRequest) (ApprovalResponse, error)
// External systems interact via signals
SignalApprove(workflowID, runID, ApprovalResponse)
SignalReject(workflowID, runID, reason)
SignalEditState(workflowID, runID, StatePatch)
// Queries for UI
QueryCurrentState(workflowID) -> AgentState
QueryPendingApprovals(workflowID) -> []ApprovalRequest
QueryDecisionLog(workflowID) -> []DecisionRecord
Interrupt patterns:
- Pre-tool approval: "Agent wants to call
send_email. Approve?" - Mid-run checkpoint: "Agent has drafted a plan. Review before execution?"
- Timeout escalation: "Agent stuck for 5 minutes. Intervene?"
2.2 Decision Log (First-Class Forensics)
type DecisionRecord struct {
StepIndex int
Timestamp time.Time
// Inputs (deterministic hash)
PromptHash string
StateHash string
ToolSchemaHash string
ModelConfig ModelConfig
// Outputs
Decision Decision
TokensUsed int
Latency time.Duration
// Provenance
ModelID string
ModelVersion string
}
Enables:
- "Why did the agent do X?" → Find the decision, see exact inputs
- "Would it do the same thing today?" → Replay with same hashes
- Regression testing: golden decision logs as test fixtures
2.3 Replay and Time-Travel Debugging
- Leverage Temporal's native workflow replay
- Add: "replay from step N with modified state"
- Add: "re-run decision at step N with current model" (A/B testing decisions)
2.4 Continue-As-New for Longevity
- Trigger: state size > threshold OR step count > threshold
- Seamless: external callers don't see the boundary
- State compaction: summarize conversation history before rollover
2.5 Failure Semantics
type ToolRetryPolicy struct {
MaxAttempts int
InitialBackoff time.Duration
MaxBackoff time.Duration
BackoffCoefficient float64
NonRetryableErrors []string // e.g., "validation_error"
}
type CircuitBreakerConfig struct {
FailureThreshold int
RecoveryTimeout time.Duration
}
- Per-tool retry policies (HTTP tools retry, validation errors don't)
- Circuit breakers for external APIs
- Graceful degradation: "tool unavailable, skip or substitute"
2.6 MCP Tool Bridge (External Tool Servers)
Goal: Allow MCP servers to be registered and invoked as tools within the runtime.
- Adapter to map MCP tool metadata to
ToolDefinition(schema + description) - MCP tool invocation path through
ToolExecuteActivity - Versioned registry entries for MCP tool sets
- Configurable MCP server endpoints
- Test with local MCP servers (e.g.,
../mcp-notion,../mcp-todoist)
Exit Criteria
- Agent pauses, waits for human signal, resumes correctly
- Decision log captures all inputs with hashes
- Can replay workflow from Temporal UI and get same result
- Continue-as-new triggers cleanly on long runs
- MCP tools can be registered and invoked as standard tools
Phase 3: Structured Execution Graphs
Goal: Support branching, parallelism, and conditional routing without exposing Temporal's selector complexity.
Deliverables
3.1 Step DSL
agent := NewAgent("research-agent").
Step("plan", PlanStep).
Step("research", ResearchStep,
When(func(s State) bool { return s.NeedsResearch })).
Parallel("gather",
Branch("web", WebSearchStep),
Branch("db", DatabaseStep),
).Join(AggregateResults).
Step("synthesize", SynthesizeStep).
Step("review", ReviewStep,
Interrupt(RequireApproval("final_review"))).
Build()
Compiles to: Temporal workflow with proper selector logic, join points, and signal waits.
3.2 Conditional Transitions
Step("classify", ClassifyStep).
Route(
On("simple", DirectAnswerStep),
On("complex", DeepResearchStep),
On("dangerous", EscalateToHumanStep),
Default(FallbackStep),
)
3.3 Parallel Execution with Join Semantics
| Join Type | Behavior |
|---|---|
JoinAll | Wait for all branches, aggregate results |
JoinAny | Return first successful result, cancel others |
JoinN(n) | Wait for N successes, cancel rest |
JoinVote | Wait for all, return majority decision |
3.4 Subgraph Composition
// Define reusable subgraphs
ResearchSubgraph := Subgraph("research").
Step("search", SearchStep).
Step("filter", FilterStep).
Step("summarize", SummarizeStep).
Build()
// Compose into larger workflows
MainAgent := NewAgent("main").
Step("plan", PlanStep).
Include(ResearchSubgraph). // inline expansion
Step("respond", RespondStep).
Build()
Exit Criteria
- DSL compiles to valid Temporal workflows
- Parallel branches execute concurrently
- Join semantics work correctly (tested: all, any, vote)
- Subgraphs compose without namespace collisions
Phase 4: Memory Architecture
Goal: Structured memory that survives workflow boundaries and scales.
Deliverables
4.1 Memory Taxonomy
| Memory Type | Scope | Storage | Example |
|---|---|---|---|
| Working | Current step | Workflow state | "User just said X" |
| Episodic | Current run | Workflow state + external | "Earlier in this conversation..." |
| Semantic | Cross-run | External store | "User prefers formal tone" |
| Procedural | Global | External store | "How to search the web" |
4.2 Memory Store Interface
type MemoryStore interface {
// Episodic
SaveEpisode(ctx context.Context, runID string, episode Episode) error
GetEpisodes(ctx context.Context, runID string, filter EpisodeFilter) ([]Episode, error)
// Semantic (vector-backed)
SaveFact(ctx context.Context, namespace string, fact Fact, embedding []float32) error
QueryFacts(ctx context.Context, namespace string, query []float32, k int) ([]Fact, error)
// Procedural
SaveProcedure(ctx context.Context, name string, procedure Procedure) error
GetProcedure(ctx context.Context, name string) (Procedure, error)
}
Implementations:
PostgresMemoryStore(recommended for most cases)RedisMemoryStore(high-throughput, TTL-based episodic)PineconeMemoryStore(semantic search at scale)InMemoryStore(testing)
4.3 Automatic Memory Management
type MemoryPolicy struct {
WorkingMemoryLimit int // max items in working memory
EpisodicSummarize int // summarize after N episodes
SemanticExtract bool // auto-extract facts to semantic store
ConversationWindow int // sliding window for context
}
- Automatic summarization when episodic memory grows
- Fact extraction: identify and persist durable knowledge
- Forgetting: TTL-based expiration, relevance-based pruning
4.4 Cross-Run Persistence
// Start new run with memory from previous runs
input := AgentRunInput{
RunID: "run-456",
MemoryRefs: []MemoryRef{
{Type: "episodic", RunID: "run-123"}, // continue conversation
{Type: "semantic", Namespace: "user-prefs"},
},
}
Exit Criteria
- Working memory bounded and performant
- Episodic memory persists across continue-as-new
- Semantic search returns relevant facts
- Memory policies auto-manage growth
Phase 5: Multi-Agent Orchestration
Goal: First-class support for agent coordination patterns.
Deliverables
5.1 Supervisor Pattern
supervisor := NewSupervisor("project-manager").
Agent("researcher", ResearcherAgent).
Agent("writer", WriterAgent).
Agent("critic", CriticAgent).
Coordinate(func(ctx SupervisorContext) {
research := ctx.Delegate("researcher", task)
draft := ctx.Delegate("writer", research.Output)
review := ctx.Delegate("critic", draft.Output)
if review.Approved {
return draft.Output
}
// iterate...
}).
Build()
5.2 Coordination Patterns
| Pattern | Description | Use Case |
|---|---|---|
| Delegate | Supervisor assigns task, waits for result | Task decomposition |
| Broadcast | Send to all agents, collect responses | Brainstorming |
| Debate | Agents argue positions, supervisor decides | Decision-making |
| Pipeline | Chain agents sequentially | Document processing |
| MapReduce | Parallel processing, aggregated results | Large-scale analysis |
| Voting | Agents vote, majority wins | Consensus |
5.3 Inter-Agent Communication
// Direct messaging (via Temporal signals)
ctx.Send("critic", Message{Type: "feedback", Content: feedback})
// Shared blackboard (external state)
ctx.Blackboard.Write("current_draft", draft)
draft := ctx.Blackboard.Read("current_draft")
// Event bus (pub/sub via Temporal)
ctx.Publish("draft_updated", DraftEvent{...})
ctx.Subscribe("draft_updated", handler)
5.4 Saga Pattern for Multi-Agent Transactions
saga := NewSaga("order-processing").
Step("validate", ValidateAgent, CompensateValidation).
Step("reserve", InventoryAgent, CompensateReservation).
Step("charge", PaymentAgent, CompensateCharge).
Step("fulfill", FulfillmentAgent, CompensateFulfillment).
Build()
- Automatic compensation on failure
- Distributed transaction semantics across agents
- Partial completion visibility
Exit Criteria
- Supervisor can coordinate 3+ agents
- Debate pattern produces reasoned output
- Saga compensates correctly on mid-process failure
- Blackboard state consistent across agents
Phase 6: Differentiating Features
Goal: Capabilities that don't exist in LangGraph or any current framework.
6.1 Causal Tracing
What: Track exactly which inputs influenced which outputs across the entire agent execution.
type CausalTrace struct {
OutputID string
Influences []Influence
}
type Influence struct {
InputID string // which input
InputType string // "user_message", "tool_result", "memory_fact"
Pathway []string // steps it passed through
Confidence float64 // attribution strength
}
// Query: "Why did the agent recommend product X?"
trace := GetCausalTrace(runID, outputID)
// Returns: user mentioned "budget", tool returned price data, memory had preference
Enables:
- Explainability: "The agent recommended X because..."
- Debugging: "This output was wrong because this tool returned bad data"
- Compliance: Audit trail for regulated industries
6.2 Counterfactual Replay
What: "What would have happened if...?" simulation.
// Original run made decision A at step 3
// What if it had made decision B?
counterfactual := ReplayWithOverride(runID, Override{
StepIndex: 3,
Decision: DecisionB,
})
// Returns: full execution trace of alternate timeline
Enables:
- Decision analysis: compare outcomes of different choices
- Training data: generate (decision, outcome) pairs
- Root cause: "If the agent had done X instead, would it have succeeded?"
6.3 Adaptive Retry with Learning
What: Retry policies that learn from failure patterns.
type AdaptiveRetryPolicy struct {
BasePolicy RetryPolicy
LearningEnabled bool
// Learned adjustments
SuccessPatterns []Pattern // conditions that predict success
FailurePatterns []Pattern // conditions that predict failure
}
// System learns: "HTTP calls to api.example.com fail 80% between 2-3pm UTC"
// Automatically: delays retries, switches to backup, or skips
6.4 Cost-Aware Execution
What: Track and optimize for cost, not just correctness.
type CostPolicy struct {
MaxCostPerRun float64
MaxCostPerStep float64
PreferCheaper bool // choose cheaper model when confidence high
CostBreakdown bool // detailed cost attribution
}
type StepCost struct {
StepIndex int
TokenCost float64
ToolCost float64 // API calls, compute, etc.
TimeCost time.Duration
TotalCost float64
}
// Query: "Which step is most expensive?"
// Action: Auto-switch to cheaper model for low-stakes decisions
6.5 Semantic Versioning for Agent Behavior
What: Version agents like software, with semantic meaning.
type AgentVersion struct {
Version string // semver
Changelog string
// Breaking change detection
SchemaChanges []SchemaChange // tool interface changes
BehaviorChanges []BehaviorChange // detected via test suite
// Compatibility
Compatible []string // versions this can replace
MigrationPath func(State) State // state migration if needed
}
// Deploy with confidence: "v2.1.0 is backward-compatible with v2.0.x"
// Rollback safely: "v2.2.0 broke X, rolling back to v2.1.0"
6.6 Differential Testing
What: Automatically compare agent versions.
// Run same inputs through two agent versions
diff := DifferentialTest(
AgentV1, AgentV2,
TestSuite{
Inputs: []Input{...},
Metrics: []Metric{Correctness, Cost, Latency},
},
)
// Output: "V2 is 15% more accurate, 20% cheaper, but 10% slower"
// Breakdown by input category, failure modes, edge cases
6.7 Live Agent Inspection
What: Attach to running agents and inspect/modify state.
// From CLI or UI
inspect --workflow-id agent-run-123
> state # print current state
> state.memory # drill into memory
> set state.goal "new goal" # modify state (triggers re-evaluation)
> step # execute one step, pause
> continue # resume normal execution
> inject tool_result {...} # inject synthetic tool result
Enables:
- Debugging stuck agents in production
- Manual intervention without restart
- Training/demo scenarios with controlled inputs
6.8 Speculative Execution
What: Run multiple decision paths in parallel, commit the best one.
type SpeculativeConfig struct {
Enabled bool
MaxBranches int // how many paths to explore
SelectionFn SelectionFunc // how to pick winner
CostLimit float64 // max extra cost for speculation
}
// Agent uncertain between 3 approaches
// System runs all 3 in parallel (as child workflows)
// Commits the one that succeeds first / scores highest / costs least
Enables:
- Faster resolution of uncertain decisions
- Automatic A/B testing of strategies
- Graceful handling of "I'm not sure, let me try both"
Phase 7A: Developer Experience — CLI + Scaffolding
Goal: Fast path to a working agent project and local iteration.
Deliverables
- CLI project scaffolding:
temporal-agent init my-agenttemporal-agent dev(local run + hot reload)temporal-agent deploy --cluster prod
- Environment bootstrap (compose, env templates, registry seed)
- Local workflow inspection shortcuts (list runs, latest run, cost summary)
Exit Criteria
- New project created in < 2 minutes
-
devruns local worker + sample run end-to-end
Phase 7B: Developer Experience — Testing & Replay Harness
Goal: Deterministic tests and safe replay tooling.
Deliverables
- Deterministic LLM mocks + tool mocks
- Golden input tests for decision logs
- Replay helpers (from step N, override decision/state)
func TestResearchAgent(t *testing.T) {
suite := agenttest.NewSuite(t)
// Deterministic LLM mock
suite.MockLLM(agenttest.Deterministic(map[string]string{
"hash-of-prompt-1": "expected-response-1",
}))
// Tool mocks
suite.MockTool("http_get", func(args Args) Result {
return Result{Body: "mock response"}
})
// Run and assert
result := suite.Run(ResearchAgent, Input{Prompt: "test"})
assert.Equal(t, "expected output", result.Output)
assert.Equal(t, 3, result.StepCount)
assert.NoError(t, result.Error)
}
Exit Criteria
- Tests run deterministically with mocks
- Replay from step N reproducible in CI
Phase 7C: Developer Experience — Web UI + VS Code Extension
Goal: First-class operator UI and editor integration.
Deliverables
Web UI
- Run list view (status, duration, cost)
- Run detail view (timeline, state diffs, tool calls)
- Interrupt queue (approvals/rejections)
- Replay console (compare original vs replay)
VS Code Extension
- DSL syntax highlighting
- Inline tool schema validation
- Run agent + output panel
- Breakpoints and step debugging
- Temporal workflow visualization
Exit Criteria
- UI shows all run details with drill-down
- Extension provides meaningful DX improvement
Phase 7D: Provider Parity — Azure OpenAI
Goal: Azure OpenAI support with the same model catalog + cost pipeline as OpenRouter.
Deliverables
- Azure OpenAI client (chat + tools) with config-based provider selection
- Model catalog ingestion (list models + context limits + pricing)
- Cost accounting with Azure usage metrics
- Per-run/provider selection (OpenRouter or Azure) and fallback rules
- Tooling parity:
models-refresh,models-list, cost query
Exit Criteria
- Azure and OpenRouter interchangeable via config
- Cost tracking matches provider usage
Phase 7E: Sandboxed Python Tool
Goal: Provide a safe, sandboxed Python execution tool that agents can call as part of workflows.
Deliverables
python_exectool with strict resource limits and no network access- Output capture (stdout/stderr/exit code)
- Tool schema with timeouts and input payloads
- Tests covering timeouts and sandbox constraints
Exit Criteria
- Tool executes simple scripts safely
- Timeouts enforced
- Network and filesystem write access blocked
Phase 8: Integrations & Ecosystem
Goal: Batteries included for common use cases.
8.1 LLM Providers
| Provider | Status | Features |
|---|---|---|
| OpenAI | Priority | GPT-4, function calling, vision |
| Anthropic | Priority | Claude, tool use |
| Phase 2 | Gemini | |
| AWS Bedrock | Phase 2 | Multi-model |
| Local (Ollama) | Phase 2 | Self-hosted |
| Azure OpenAI | Phase 7D | Enterprise |
8.2 Tool Kits
| Kit | Tools |
|---|---|
| Web | HTTP GET/POST, scrape, screenshot |
| Search | Google, Bing, DuckDuckGo, Tavily |
| Database | SQL query, vector search |
| Files | Read, write, parse (PDF, DOCX, etc.) |
| Code | Execute Python/JS, sandbox |
| Communication | Email, Slack, SMS |
| Cloud | AWS, GCP, Azure SDKs |
8.3 Vector Stores
- Pinecone
- Weaviate
- Qdrant
- pgvector
- Chroma (local)
8.4 Observability
- OpenTelemetry (native)
- Datadog
- Honeycomb
- Grafana stack
8.5 Auth & Secrets
- Vault integration
- AWS Secrets Manager
- Environment injection
- OAuth flows for tools
Phase 9: Enterprise Hardening
Goal: Production-ready for serious workloads.
Deliverables
9.1 Multi-Tenancy
type TenantConfig struct {
TenantID string
Namespace string // Temporal namespace isolation
ResourceQuotas ResourceQuotas
RateLimits RateLimits
AllowedModels []string
AllowedTools []string
}
9.2 Rate Limiting & Quotas
- Per-tenant rate limits
- Per-model token quotas
- Per-tool call limits
- Graceful degradation on quota exhaustion
9.3 Audit Logging
type AuditEvent struct {
Timestamp time.Time
TenantID string
RunID string
EventType string // "decision", "tool_call", "state_change", "approval"
Actor string // "agent", "user:123", "system"
Details any
Hash string // tamper-evident
}
- Immutable audit log
- Compliance exports (SOC2, HIPAA formats)
- PII detection and redaction
9.4 Security
- Tool sandboxing (gVisor, Firecracker)
- Input sanitization
- Output filtering
- Network policies for tools
9.5 High Availability
- Multi-region Temporal deployment guide
- State replication strategies
- Disaster recovery playbook
Exit Criteria
- Tenants isolated at namespace level
- Rate limits enforced without race conditions
- Audit log tamper-evident
- Security review passed
Timeline (Estimated)
| Phase | Duration | Cumulative |
|---|---|---|
| Phase 1: MVP | 4-6 weeks | 6 weeks |
| Phase 2: Checkpoints/HITL | 4-6 weeks | 12 weeks |
| Phase 3: Graphs | 3-4 weeks | 16 weeks |
| Phase 4: Memory | 4-6 weeks | 22 weeks |
| Phase 5: Multi-Agent | 4-6 weeks | 28 weeks |
| Phase 6: Differentiators | 6-8 weeks | 36 weeks |
| Phase 7A: DevEx CLI | 2-3 weeks | 38-39 weeks |
| Phase 7B: DevEx Testing | 2-3 weeks | 40-42 weeks |
| Phase 7C: DevEx UI/Extension | 4-6 weeks | 44-48 weeks |
| Phase 7D: Azure Provider Parity | 2-3 weeks | 46-51 weeks |
| Phase 8: Integrations | Ongoing | - |
| Phase 9: Enterprise | 6-8 weeks | 50 weeks |
Note: Phases 7-8 can run in parallel with 5-6. Enterprise hardening (Phase 9) can begin after Phase 2 if there's early enterprise interest.
Success Metrics
Adoption
- GitHub stars (vanity but visibility)
- Production deployments (real metric)
- Community contributions
Technical
- Agent survival rate (% completing without crash)
- Mean time to recovery (on failure)
- P99 step latency
Developer Experience
- Time to first working agent
- Test coverage achievable
- Debug time for failures
Differentiation
- Features not available elsewhere
- Performance vs alternatives
- Operational clarity vs alternatives
Open Questions
- DSL syntax: Go-native (method chaining) vs external config (YAML/JSON)?
- State serialization: Protobuf vs JSON vs custom?
- Memory store default: Postgres (safe) vs Redis (fast)?
- UI: Build custom vs extend Temporal UI?
- Licensing: Apache 2.0 vs MIT vs BSL?
Appendix: Why Not Just Use LangGraph?
LangGraph is excellent for prototyping. It falls short for production:
| Concern | LangGraph | This Runtime |
|---|---|---|
| Durability | Library checkpointers (Postgres, Redis) | Temporal cluster (distributed, battle-tested) |
| Failure handling | Manual retry logic | Infrastructure-level retries, timeouts, circuit breakers |
| Multi-agent | Graph composition | First-class supervision, sagas, compensation |
| Debugging | Thread history | Full replay, counterfactual analysis, causal tracing |
| Operational control | Basic | Signals, queries, live inspection, hot fixes |
| Scale | Single process | Distributed workers, horizontal scale |
| Long-running | Memory issues | Continue-as-new, bounded histories |
We're not building "LangGraph but different." We're building a production-grade agent runtime that happens to match (and exceed) LangGraph's capabilities.
Last updated: Phase 0 (Planning)