Skip to main content

Temporal Agent Runtime — Roadmap

A production-grade, Temporal-native agent runtime that treats durability, observability, and operational control as first-class concerns—not afterthoughts.


Vision

Most agent frameworks optimize for demo speed. We optimize for production survival.

LangGraph bolted durability onto a graph abstraction. We invert this: Temporal's battle-tested workflow engine is the foundation, and the agent abstraction compiles down to it. The result is an agent runtime where retry logic, failure recovery, human-in-the-loop, and multi-agent coordination aren't library features—they're infrastructure guarantees.


Differentiators (Why This Exists)

CapabilityLangGraphThis Runtime
DurabilityLibrary-level checkpointersInfrastructure-grade (Temporal cluster)
Failure recoveryManual retry configAutomatic, policy-driven, battle-tested
Human-in-the-loopInterrupt APINative signals + queries with timeout semantics
Replay/debuggingThread historyFull workflow replay with deterministic re-execution
Multi-agentGraph compositionFirst-class supervisor workflows, sagas, compensation
ObservabilityLogging hooksNative OpenTelemetry, distributed tracing across agents
Time-travel debuggingCheckpoint inspectionReplay any workflow from any point in history
Long-running agentsMemory managementContinue-as-new, external state, bounded histories

Phase 1: Durable Single-Agent Loop (MVP)

Goal: A minimal agent that survives crashes, retries failures, and runs on Temporal.

Deliverables

1.1 Core Workflow: AgentRunWorkflow

type AgentRunInput struct {
RunID string
InitialPrompt string
ToolRegistryRef ToolRegistryRef // immutable version reference
Config AgentConfig
}

type AgentConfig struct {
Model string
MaxSteps int
StepTimeout time.Duration
TotalTimeout time.Duration
}
  • Deterministic state machine loop: decide → act → observe → repeat
  • Clean stop conditions: max steps, goal achieved, timeout, explicit halt
  • All non-determinism isolated to activities

1.2 Activities

ActivityPurpose
LLMDecideActivityCall LLM with current state, return structured decision
ToolExecuteActivityExecute a tool with validated arguments
ObserveActivityProcess tool results, update observations

1.3 Tool Registry (Immutable + Versioned from Day 1)

type ToolRegistry struct {
ID string
Version string // semver, immutable once published
Tools []ToolDefinition
CreatedAt time.Time
Hash string // content-addressable
}

type ToolDefinition struct {
Name string
Description string
Schema JSONSchema // strict validation
Executor ToolExecutor // activity implementation
RetryPolicy *RetryPolicy // per-tool failure handling
}

Why versioned from Phase 1: Debugging agent behavior requires knowing exactly what tools it had. Mutable registries destroy forensics.

1.4 Observability (Production-Shaped from Start)

  • Workflow-level: run ID, steps completed, current state hash
  • Activity-level: latency, token usage, retry count
  • OpenTelemetry spans: one span per decision cycle, child spans for tool calls
  • Structured logs with correlation IDs

1.5 Example Project

  • 3 tools: http_get, calculator, key_value_store
  • One agent: "Research assistant that can fetch URLs, compute, and remember facts"
  • Demonstrates: retry on HTTP failure, timeout handling, clean shutdown

Exit Criteria

  • Agent survives worker restart mid-run
  • Failed tool calls retry with backoff
  • Full trace visible in Jaeger/Tempo
  • Tool schema validation rejects malformed calls

Phase 2: Checkpoints, Replay, and Human-in-the-Loop

Goal: Match LangGraph's operational UX (pause/resume, debugging, interrupts) using Temporal primitives.

Deliverables

2.1 Interrupt System (Signals + Queries)

// Workflow can pause and wait for human input
func (w *AgentWorkflow) WaitForApproval(ctx workflow.Context, request ApprovalRequest) (ApprovalResponse, error)

// External systems interact via signals
SignalApprove(workflowID, runID, ApprovalResponse)
SignalReject(workflowID, runID, reason)
SignalEditState(workflowID, runID, StatePatch)

// Queries for UI
QueryCurrentState(workflowID) -> AgentState
QueryPendingApprovals(workflowID) -> []ApprovalRequest
QueryDecisionLog(workflowID) -> []DecisionRecord

Interrupt patterns:

  • Pre-tool approval: "Agent wants to call send_email. Approve?"
  • Mid-run checkpoint: "Agent has drafted a plan. Review before execution?"
  • Timeout escalation: "Agent stuck for 5 minutes. Intervene?"

2.2 Decision Log (First-Class Forensics)

type DecisionRecord struct {
StepIndex int
Timestamp time.Time

// Inputs (deterministic hash)
PromptHash string
StateHash string
ToolSchemaHash string
ModelConfig ModelConfig

// Outputs
Decision Decision
TokensUsed int
Latency time.Duration

// Provenance
ModelID string
ModelVersion string
}

Enables:

  • "Why did the agent do X?" → Find the decision, see exact inputs
  • "Would it do the same thing today?" → Replay with same hashes
  • Regression testing: golden decision logs as test fixtures

2.3 Replay and Time-Travel Debugging

  • Leverage Temporal's native workflow replay
  • Add: "replay from step N with modified state"
  • Add: "re-run decision at step N with current model" (A/B testing decisions)

2.4 Continue-As-New for Longevity

  • Trigger: state size > threshold OR step count > threshold
  • Seamless: external callers don't see the boundary
  • State compaction: summarize conversation history before rollover

2.5 Failure Semantics

type ToolRetryPolicy struct {
MaxAttempts int
InitialBackoff time.Duration
MaxBackoff time.Duration
BackoffCoefficient float64
NonRetryableErrors []string // e.g., "validation_error"
}

type CircuitBreakerConfig struct {
FailureThreshold int
RecoveryTimeout time.Duration
}
  • Per-tool retry policies (HTTP tools retry, validation errors don't)
  • Circuit breakers for external APIs
  • Graceful degradation: "tool unavailable, skip or substitute"

2.6 MCP Tool Bridge (External Tool Servers)

Goal: Allow MCP servers to be registered and invoked as tools within the runtime.

  • Adapter to map MCP tool metadata to ToolDefinition (schema + description)
  • MCP tool invocation path through ToolExecuteActivity
  • Versioned registry entries for MCP tool sets
  • Configurable MCP server endpoints
  • Test with local MCP servers (e.g., ../mcp-notion, ../mcp-todoist)

Exit Criteria

  • Agent pauses, waits for human signal, resumes correctly
  • Decision log captures all inputs with hashes
  • Can replay workflow from Temporal UI and get same result
  • Continue-as-new triggers cleanly on long runs
  • MCP tools can be registered and invoked as standard tools

Phase 3: Structured Execution Graphs

Goal: Support branching, parallelism, and conditional routing without exposing Temporal's selector complexity.

Deliverables

3.1 Step DSL

agent := NewAgent("research-agent").
Step("plan", PlanStep).
Step("research", ResearchStep,
When(func(s State) bool { return s.NeedsResearch })).
Parallel("gather",
Branch("web", WebSearchStep),
Branch("db", DatabaseStep),
).Join(AggregateResults).
Step("synthesize", SynthesizeStep).
Step("review", ReviewStep,
Interrupt(RequireApproval("final_review"))).
Build()

Compiles to: Temporal workflow with proper selector logic, join points, and signal waits.

3.2 Conditional Transitions

Step("classify", ClassifyStep).
Route(
On("simple", DirectAnswerStep),
On("complex", DeepResearchStep),
On("dangerous", EscalateToHumanStep),
Default(FallbackStep),
)

3.3 Parallel Execution with Join Semantics

Join TypeBehavior
JoinAllWait for all branches, aggregate results
JoinAnyReturn first successful result, cancel others
JoinN(n)Wait for N successes, cancel rest
JoinVoteWait for all, return majority decision

3.4 Subgraph Composition

// Define reusable subgraphs
ResearchSubgraph := Subgraph("research").
Step("search", SearchStep).
Step("filter", FilterStep).
Step("summarize", SummarizeStep).
Build()

// Compose into larger workflows
MainAgent := NewAgent("main").
Step("plan", PlanStep).
Include(ResearchSubgraph). // inline expansion
Step("respond", RespondStep).
Build()

Exit Criteria

  • DSL compiles to valid Temporal workflows
  • Parallel branches execute concurrently
  • Join semantics work correctly (tested: all, any, vote)
  • Subgraphs compose without namespace collisions

Phase 4: Memory Architecture

Goal: Structured memory that survives workflow boundaries and scales.

Deliverables

4.1 Memory Taxonomy

Memory TypeScopeStorageExample
WorkingCurrent stepWorkflow state"User just said X"
EpisodicCurrent runWorkflow state + external"Earlier in this conversation..."
SemanticCross-runExternal store"User prefers formal tone"
ProceduralGlobalExternal store"How to search the web"

4.2 Memory Store Interface

type MemoryStore interface {
// Episodic
SaveEpisode(ctx context.Context, runID string, episode Episode) error
GetEpisodes(ctx context.Context, runID string, filter EpisodeFilter) ([]Episode, error)

// Semantic (vector-backed)
SaveFact(ctx context.Context, namespace string, fact Fact, embedding []float32) error
QueryFacts(ctx context.Context, namespace string, query []float32, k int) ([]Fact, error)

// Procedural
SaveProcedure(ctx context.Context, name string, procedure Procedure) error
GetProcedure(ctx context.Context, name string) (Procedure, error)
}

Implementations:

  • PostgresMemoryStore (recommended for most cases)
  • RedisMemoryStore (high-throughput, TTL-based episodic)
  • PineconeMemoryStore (semantic search at scale)
  • InMemoryStore (testing)

4.3 Automatic Memory Management

type MemoryPolicy struct {
WorkingMemoryLimit int // max items in working memory
EpisodicSummarize int // summarize after N episodes
SemanticExtract bool // auto-extract facts to semantic store
ConversationWindow int // sliding window for context
}
  • Automatic summarization when episodic memory grows
  • Fact extraction: identify and persist durable knowledge
  • Forgetting: TTL-based expiration, relevance-based pruning

4.4 Cross-Run Persistence

// Start new run with memory from previous runs
input := AgentRunInput{
RunID: "run-456",
MemoryRefs: []MemoryRef{
{Type: "episodic", RunID: "run-123"}, // continue conversation
{Type: "semantic", Namespace: "user-prefs"},
},
}

Exit Criteria

  • Working memory bounded and performant
  • Episodic memory persists across continue-as-new
  • Semantic search returns relevant facts
  • Memory policies auto-manage growth

Phase 5: Multi-Agent Orchestration

Goal: First-class support for agent coordination patterns.

Deliverables

5.1 Supervisor Pattern

supervisor := NewSupervisor("project-manager").
Agent("researcher", ResearcherAgent).
Agent("writer", WriterAgent).
Agent("critic", CriticAgent).
Coordinate(func(ctx SupervisorContext) {
research := ctx.Delegate("researcher", task)
draft := ctx.Delegate("writer", research.Output)
review := ctx.Delegate("critic", draft.Output)
if review.Approved {
return draft.Output
}
// iterate...
}).
Build()

5.2 Coordination Patterns

PatternDescriptionUse Case
DelegateSupervisor assigns task, waits for resultTask decomposition
BroadcastSend to all agents, collect responsesBrainstorming
DebateAgents argue positions, supervisor decidesDecision-making
PipelineChain agents sequentiallyDocument processing
MapReduceParallel processing, aggregated resultsLarge-scale analysis
VotingAgents vote, majority winsConsensus

5.3 Inter-Agent Communication

// Direct messaging (via Temporal signals)
ctx.Send("critic", Message{Type: "feedback", Content: feedback})

// Shared blackboard (external state)
ctx.Blackboard.Write("current_draft", draft)
draft := ctx.Blackboard.Read("current_draft")

// Event bus (pub/sub via Temporal)
ctx.Publish("draft_updated", DraftEvent{...})
ctx.Subscribe("draft_updated", handler)

5.4 Saga Pattern for Multi-Agent Transactions

saga := NewSaga("order-processing").
Step("validate", ValidateAgent, CompensateValidation).
Step("reserve", InventoryAgent, CompensateReservation).
Step("charge", PaymentAgent, CompensateCharge).
Step("fulfill", FulfillmentAgent, CompensateFulfillment).
Build()
  • Automatic compensation on failure
  • Distributed transaction semantics across agents
  • Partial completion visibility

Exit Criteria

  • Supervisor can coordinate 3+ agents
  • Debate pattern produces reasoned output
  • Saga compensates correctly on mid-process failure
  • Blackboard state consistent across agents

Phase 6: Differentiating Features

Goal: Capabilities that don't exist in LangGraph or any current framework.

6.1 Causal Tracing

What: Track exactly which inputs influenced which outputs across the entire agent execution.

type CausalTrace struct {
OutputID string
Influences []Influence
}

type Influence struct {
InputID string // which input
InputType string // "user_message", "tool_result", "memory_fact"
Pathway []string // steps it passed through
Confidence float64 // attribution strength
}

// Query: "Why did the agent recommend product X?"
trace := GetCausalTrace(runID, outputID)
// Returns: user mentioned "budget", tool returned price data, memory had preference

Enables:

  • Explainability: "The agent recommended X because..."
  • Debugging: "This output was wrong because this tool returned bad data"
  • Compliance: Audit trail for regulated industries

6.2 Counterfactual Replay

What: "What would have happened if...?" simulation.

// Original run made decision A at step 3
// What if it had made decision B?
counterfactual := ReplayWithOverride(runID, Override{
StepIndex: 3,
Decision: DecisionB,
})

// Returns: full execution trace of alternate timeline

Enables:

  • Decision analysis: compare outcomes of different choices
  • Training data: generate (decision, outcome) pairs
  • Root cause: "If the agent had done X instead, would it have succeeded?"

6.3 Adaptive Retry with Learning

What: Retry policies that learn from failure patterns.

type AdaptiveRetryPolicy struct {
BasePolicy RetryPolicy
LearningEnabled bool

// Learned adjustments
SuccessPatterns []Pattern // conditions that predict success
FailurePatterns []Pattern // conditions that predict failure
}

// System learns: "HTTP calls to api.example.com fail 80% between 2-3pm UTC"
// Automatically: delays retries, switches to backup, or skips

6.4 Cost-Aware Execution

What: Track and optimize for cost, not just correctness.

type CostPolicy struct {
MaxCostPerRun float64
MaxCostPerStep float64
PreferCheaper bool // choose cheaper model when confidence high
CostBreakdown bool // detailed cost attribution
}

type StepCost struct {
StepIndex int
TokenCost float64
ToolCost float64 // API calls, compute, etc.
TimeCost time.Duration
TotalCost float64
}

// Query: "Which step is most expensive?"
// Action: Auto-switch to cheaper model for low-stakes decisions

6.5 Semantic Versioning for Agent Behavior

What: Version agents like software, with semantic meaning.

type AgentVersion struct {
Version string // semver
Changelog string

// Breaking change detection
SchemaChanges []SchemaChange // tool interface changes
BehaviorChanges []BehaviorChange // detected via test suite

// Compatibility
Compatible []string // versions this can replace
MigrationPath func(State) State // state migration if needed
}

// Deploy with confidence: "v2.1.0 is backward-compatible with v2.0.x"
// Rollback safely: "v2.2.0 broke X, rolling back to v2.1.0"

6.6 Differential Testing

What: Automatically compare agent versions.

// Run same inputs through two agent versions
diff := DifferentialTest(
AgentV1, AgentV2,
TestSuite{
Inputs: []Input{...},
Metrics: []Metric{Correctness, Cost, Latency},
},
)

// Output: "V2 is 15% more accurate, 20% cheaper, but 10% slower"
// Breakdown by input category, failure modes, edge cases

6.7 Live Agent Inspection

What: Attach to running agents and inspect/modify state.

// From CLI or UI
inspect --workflow-id agent-run-123

> state # print current state
> state.memory # drill into memory
> set state.goal "new goal" # modify state (triggers re-evaluation)
> step # execute one step, pause
> continue # resume normal execution
> inject tool_result {...} # inject synthetic tool result

Enables:

  • Debugging stuck agents in production
  • Manual intervention without restart
  • Training/demo scenarios with controlled inputs

6.8 Speculative Execution

What: Run multiple decision paths in parallel, commit the best one.

type SpeculativeConfig struct {
Enabled bool
MaxBranches int // how many paths to explore
SelectionFn SelectionFunc // how to pick winner
CostLimit float64 // max extra cost for speculation
}

// Agent uncertain between 3 approaches
// System runs all 3 in parallel (as child workflows)
// Commits the one that succeeds first / scores highest / costs least

Enables:

  • Faster resolution of uncertain decisions
  • Automatic A/B testing of strategies
  • Graceful handling of "I'm not sure, let me try both"

Phase 7A: Developer Experience — CLI + Scaffolding

Goal: Fast path to a working agent project and local iteration.

Deliverables

  • CLI project scaffolding:
    • temporal-agent init my-agent
    • temporal-agent dev (local run + hot reload)
    • temporal-agent deploy --cluster prod
  • Environment bootstrap (compose, env templates, registry seed)
  • Local workflow inspection shortcuts (list runs, latest run, cost summary)

Exit Criteria

  • New project created in < 2 minutes
  • dev runs local worker + sample run end-to-end

Phase 7B: Developer Experience — Testing & Replay Harness

Goal: Deterministic tests and safe replay tooling.

Deliverables

  • Deterministic LLM mocks + tool mocks
  • Golden input tests for decision logs
  • Replay helpers (from step N, override decision/state)
func TestResearchAgent(t *testing.T) {
suite := agenttest.NewSuite(t)

// Deterministic LLM mock
suite.MockLLM(agenttest.Deterministic(map[string]string{
"hash-of-prompt-1": "expected-response-1",
}))

// Tool mocks
suite.MockTool("http_get", func(args Args) Result {
return Result{Body: "mock response"}
})

// Run and assert
result := suite.Run(ResearchAgent, Input{Prompt: "test"})

assert.Equal(t, "expected output", result.Output)
assert.Equal(t, 3, result.StepCount)
assert.NoError(t, result.Error)
}

Exit Criteria

  • Tests run deterministically with mocks
  • Replay from step N reproducible in CI

Phase 7C: Developer Experience — Web UI + VS Code Extension

Goal: First-class operator UI and editor integration.

Deliverables

Web UI

  • Run list view (status, duration, cost)
  • Run detail view (timeline, state diffs, tool calls)
  • Interrupt queue (approvals/rejections)
  • Replay console (compare original vs replay)

VS Code Extension

  • DSL syntax highlighting
  • Inline tool schema validation
  • Run agent + output panel
  • Breakpoints and step debugging
  • Temporal workflow visualization

Exit Criteria

  • UI shows all run details with drill-down
  • Extension provides meaningful DX improvement

Phase 7D: Provider Parity — Azure OpenAI

Goal: Azure OpenAI support with the same model catalog + cost pipeline as OpenRouter.

Deliverables

  • Azure OpenAI client (chat + tools) with config-based provider selection
  • Model catalog ingestion (list models + context limits + pricing)
  • Cost accounting with Azure usage metrics
  • Per-run/provider selection (OpenRouter or Azure) and fallback rules
  • Tooling parity: models-refresh, models-list, cost query

Exit Criteria

  • Azure and OpenRouter interchangeable via config
  • Cost tracking matches provider usage

Phase 7E: Sandboxed Python Tool

Goal: Provide a safe, sandboxed Python execution tool that agents can call as part of workflows.

Deliverables

  • python_exec tool with strict resource limits and no network access
  • Output capture (stdout/stderr/exit code)
  • Tool schema with timeouts and input payloads
  • Tests covering timeouts and sandbox constraints

Exit Criteria

  • Tool executes simple scripts safely
  • Timeouts enforced
  • Network and filesystem write access blocked

Phase 8: Integrations & Ecosystem

Goal: Batteries included for common use cases.

8.1 LLM Providers

ProviderStatusFeatures
OpenAIPriorityGPT-4, function calling, vision
AnthropicPriorityClaude, tool use
GooglePhase 2Gemini
AWS BedrockPhase 2Multi-model
Local (Ollama)Phase 2Self-hosted
Azure OpenAIPhase 7DEnterprise

8.2 Tool Kits

KitTools
WebHTTP GET/POST, scrape, screenshot
SearchGoogle, Bing, DuckDuckGo, Tavily
DatabaseSQL query, vector search
FilesRead, write, parse (PDF, DOCX, etc.)
CodeExecute Python/JS, sandbox
CommunicationEmail, Slack, SMS
CloudAWS, GCP, Azure SDKs

8.3 Vector Stores

  • Pinecone
  • Weaviate
  • Qdrant
  • pgvector
  • Chroma (local)

8.4 Observability

  • OpenTelemetry (native)
  • Datadog
  • Honeycomb
  • Grafana stack

8.5 Auth & Secrets

  • Vault integration
  • AWS Secrets Manager
  • Environment injection
  • OAuth flows for tools

Phase 9: Enterprise Hardening

Goal: Production-ready for serious workloads.

Deliverables

9.1 Multi-Tenancy

type TenantConfig struct {
TenantID string
Namespace string // Temporal namespace isolation
ResourceQuotas ResourceQuotas
RateLimits RateLimits
AllowedModels []string
AllowedTools []string
}

9.2 Rate Limiting & Quotas

  • Per-tenant rate limits
  • Per-model token quotas
  • Per-tool call limits
  • Graceful degradation on quota exhaustion

9.3 Audit Logging

type AuditEvent struct {
Timestamp time.Time
TenantID string
RunID string
EventType string // "decision", "tool_call", "state_change", "approval"
Actor string // "agent", "user:123", "system"
Details any
Hash string // tamper-evident
}
  • Immutable audit log
  • Compliance exports (SOC2, HIPAA formats)
  • PII detection and redaction

9.4 Security

  • Tool sandboxing (gVisor, Firecracker)
  • Input sanitization
  • Output filtering
  • Network policies for tools

9.5 High Availability

  • Multi-region Temporal deployment guide
  • State replication strategies
  • Disaster recovery playbook

Exit Criteria

  • Tenants isolated at namespace level
  • Rate limits enforced without race conditions
  • Audit log tamper-evident
  • Security review passed

Timeline (Estimated)

PhaseDurationCumulative
Phase 1: MVP4-6 weeks6 weeks
Phase 2: Checkpoints/HITL4-6 weeks12 weeks
Phase 3: Graphs3-4 weeks16 weeks
Phase 4: Memory4-6 weeks22 weeks
Phase 5: Multi-Agent4-6 weeks28 weeks
Phase 6: Differentiators6-8 weeks36 weeks
Phase 7A: DevEx CLI2-3 weeks38-39 weeks
Phase 7B: DevEx Testing2-3 weeks40-42 weeks
Phase 7C: DevEx UI/Extension4-6 weeks44-48 weeks
Phase 7D: Azure Provider Parity2-3 weeks46-51 weeks
Phase 8: IntegrationsOngoing-
Phase 9: Enterprise6-8 weeks50 weeks

Note: Phases 7-8 can run in parallel with 5-6. Enterprise hardening (Phase 9) can begin after Phase 2 if there's early enterprise interest.


Success Metrics

Adoption

  • GitHub stars (vanity but visibility)
  • Production deployments (real metric)
  • Community contributions

Technical

  • Agent survival rate (% completing without crash)
  • Mean time to recovery (on failure)
  • P99 step latency

Developer Experience

  • Time to first working agent
  • Test coverage achievable
  • Debug time for failures

Differentiation

  • Features not available elsewhere
  • Performance vs alternatives
  • Operational clarity vs alternatives

Open Questions

  1. DSL syntax: Go-native (method chaining) vs external config (YAML/JSON)?
  2. State serialization: Protobuf vs JSON vs custom?
  3. Memory store default: Postgres (safe) vs Redis (fast)?
  4. UI: Build custom vs extend Temporal UI?
  5. Licensing: Apache 2.0 vs MIT vs BSL?

Appendix: Why Not Just Use LangGraph?

LangGraph is excellent for prototyping. It falls short for production:

ConcernLangGraphThis Runtime
DurabilityLibrary checkpointers (Postgres, Redis)Temporal cluster (distributed, battle-tested)
Failure handlingManual retry logicInfrastructure-level retries, timeouts, circuit breakers
Multi-agentGraph compositionFirst-class supervision, sagas, compensation
DebuggingThread historyFull replay, counterfactual analysis, causal tracing
Operational controlBasicSignals, queries, live inspection, hot fixes
ScaleSingle processDistributed workers, horizontal scale
Long-runningMemory issuesContinue-as-new, bounded histories

We're not building "LangGraph but different." We're building a production-grade agent runtime that happens to match (and exceed) LangGraph's capabilities.


Last updated: Phase 0 (Planning)