Temporal Agent Runtime — Roadmap

A production-grade, Temporal-native agent runtime that treats durability, observability, and operational control as first-class concerns—not afterthoughts.

Vision

Most agent frameworks optimize for demo speed. We optimize for production survival.

LangGraph bolted durability onto a graph abstraction. We invert this: Temporal's battle-tested workflow engine is the foundation, and the agent abstraction compiles down to it. The result is an agent runtime where retry logic, failure recovery, human-in-the-loop, and multi-agent coordination aren't library features—they're infrastructure guarantees.

Differentiators (Why This Exists)

Capability	LangGraph	This Runtime
Durability	Library-level checkpointers	Infrastructure-grade (Temporal cluster)
Failure recovery	Manual retry config	Automatic, policy-driven, battle-tested
Human-in-the-loop	Interrupt API	Native signals + queries with timeout semantics
Replay/debugging	Thread history	Full workflow replay with deterministic re-execution
Multi-agent	Graph composition	First-class supervisor workflows, sagas, compensation
Observability	Logging hooks	Native OpenTelemetry, distributed tracing across agents
Time-travel debugging	Checkpoint inspection	Replay any workflow from any point in history
Long-running agents	Memory management	Continue-as-new, external state, bounded histories

Phase 1: Durable Single-Agent Loop (MVP)

Goal: A minimal agent that survives crashes, retries failures, and runs on Temporal.

Deliverables

1.1 Core Workflow: `AgentRunWorkflow`

type AgentRunInput struct {
    RunID           string
    InitialPrompt   string
    ToolRegistryRef ToolRegistryRef  // immutable version reference
    Config          AgentConfig
}

type AgentConfig struct {
    Model           string
    MaxSteps        int
    StepTimeout     time.Duration
    TotalTimeout    time.Duration
}

Deterministic state machine loop: decide → act → observe → repeat
Clean stop conditions: max steps, goal achieved, timeout, explicit halt
All non-determinism isolated to activities

1.2 Activities

Activity	Purpose
`LLMDecideActivity`	Call LLM with current state, return structured decision
`ToolExecuteActivity`	Execute a tool with validated arguments
`ObserveActivity`	Process tool results, update observations

1.3 Tool Registry (Immutable + Versioned from Day 1)

type ToolRegistry struct {
    ID        string
    Version   string          // semver, immutable once published
    Tools     []ToolDefinition
    CreatedAt time.Time
    Hash      string          // content-addressable
}

type ToolDefinition struct {
    Name        string
    Description string
    Schema      JSONSchema      // strict validation
    Executor    ToolExecutor    // activity implementation
    RetryPolicy *RetryPolicy    // per-tool failure handling
}

Why versioned from Phase 1: Debugging agent behavior requires knowing exactly what tools it had. Mutable registries destroy forensics.

1.4 Observability (Production-Shaped from Start)

Workflow-level: run ID, steps completed, current state hash
Activity-level: latency, token usage, retry count
OpenTelemetry spans: one span per decision cycle, child spans for tool calls
Structured logs with correlation IDs

1.5 Example Project

3 tools: http_get, calculator, key_value_store
One agent: "Research assistant that can fetch URLs, compute, and remember facts"
Demonstrates: retry on HTTP failure, timeout handling, clean shutdown

Exit Criteria

Agent survives worker restart mid-run
Failed tool calls retry with backoff
Full trace visible in Jaeger/Tempo
Tool schema validation rejects malformed calls

Phase 2: Checkpoints, Replay, and Human-in-the-Loop

Goal: Match LangGraph's operational UX (pause/resume, debugging, interrupts) using Temporal primitives.

Deliverables

2.1 Interrupt System (Signals + Queries)

// Workflow can pause and wait for human input
func (w *AgentWorkflow) WaitForApproval(ctx workflow.Context, request ApprovalRequest) (ApprovalResponse, error)

// External systems interact via signals
SignalApprove(workflowID, runID, ApprovalResponse)
SignalReject(workflowID, runID, reason)
SignalEditState(workflowID, runID, StatePatch)

// Queries for UI
QueryCurrentState(workflowID) -> AgentState
QueryPendingApprovals(workflowID) -> []ApprovalRequest
QueryDecisionLog(workflowID) -> []DecisionRecord

Interrupt patterns:

Pre-tool approval: "Agent wants to call send_email. Approve?"
Mid-run checkpoint: "Agent has drafted a plan. Review before execution?"
Timeout escalation: "Agent stuck for 5 minutes. Intervene?"

2.2 Decision Log (First-Class Forensics)

type DecisionRecord struct {
    StepIndex       int
    Timestamp       time.Time
    
    // Inputs (deterministic hash)
    PromptHash      string
    StateHash       string
    ToolSchemaHash  string
    ModelConfig     ModelConfig
    
    // Outputs
    Decision        Decision
    TokensUsed      int
    Latency         time.Duration
    
    // Provenance
    ModelID         string
    ModelVersion    string
}

Enables:

"Why did the agent do X?" → Find the decision, see exact inputs
"Would it do the same thing today?" → Replay with same hashes
Regression testing: golden decision logs as test fixtures

2.3 Replay and Time-Travel Debugging

Leverage Temporal's native workflow replay
Add: "replay from step N with modified state"
Add: "re-run decision at step N with current model" (A/B testing decisions)

2.4 Continue-As-New for Longevity

Trigger: state size > threshold OR step count > threshold
Seamless: external callers don't see the boundary
State compaction: summarize conversation history before rollover

2.5 Failure Semantics

type ToolRetryPolicy struct {
    MaxAttempts        int
    InitialBackoff     time.Duration
    MaxBackoff         time.Duration
    BackoffCoefficient float64
    NonRetryableErrors []string        // e.g., "validation_error"
}

type CircuitBreakerConfig struct {
    FailureThreshold   int
    RecoveryTimeout    time.Duration
}

Per-tool retry policies (HTTP tools retry, validation errors don't)
Circuit breakers for external APIs
Graceful degradation: "tool unavailable, skip or substitute"

2.6 MCP Tool Bridge (External Tool Servers)

Goal: Allow MCP servers to be registered and invoked as tools within the runtime.

Adapter to map MCP tool metadata to ToolDefinition (schema + description)
MCP tool invocation path through ToolExecuteActivity
Versioned registry entries for MCP tool sets
Configurable MCP server endpoints
Test with local MCP servers (e.g., ../mcp-notion, ../mcp-todoist)

Exit Criteria

Agent pauses, waits for human signal, resumes correctly
Decision log captures all inputs with hashes
Can replay workflow from Temporal UI and get same result
Continue-as-new triggers cleanly on long runs
MCP tools can be registered and invoked as standard tools

Phase 3: Structured Execution Graphs

Goal: Support branching, parallelism, and conditional routing without exposing Temporal's selector complexity.

Deliverables

3.1 Step DSL

agent := NewAgent("research-agent").
    Step("plan", PlanStep).
    Step("research", ResearchStep, 
        When(func(s State) bool { return s.NeedsResearch })).
    Parallel("gather",
        Branch("web", WebSearchStep),
        Branch("db", DatabaseStep),
    ).Join(AggregateResults).
    Step("synthesize", SynthesizeStep).
    Step("review", ReviewStep,
        Interrupt(RequireApproval("final_review"))).
    Build()

Compiles to: Temporal workflow with proper selector logic, join points, and signal waits.

3.2 Conditional Transitions

Step("classify", ClassifyStep).
    Route(
        On("simple", DirectAnswerStep),
        On("complex", DeepResearchStep),
        On("dangerous", EscalateToHumanStep),
        Default(FallbackStep),
    )

3.3 Parallel Execution with Join Semantics

Join Type	Behavior
`JoinAll`	Wait for all branches, aggregate results
`JoinAny`	Return first successful result, cancel others
`JoinN(n)`	Wait for N successes, cancel rest
`JoinVote`	Wait for all, return majority decision

3.4 Subgraph Composition

// Define reusable subgraphs
ResearchSubgraph := Subgraph("research").
    Step("search", SearchStep).
    Step("filter", FilterStep).
    Step("summarize", SummarizeStep).
    Build()

// Compose into larger workflows
MainAgent := NewAgent("main").
    Step("plan", PlanStep).
    Include(ResearchSubgraph).  // inline expansion
    Step("respond", RespondStep).
    Build()

Exit Criteria

DSL compiles to valid Temporal workflows
Parallel branches execute concurrently
Join semantics work correctly (tested: all, any, vote)
Subgraphs compose without namespace collisions

Phase 4: Memory Architecture

Goal: Structured memory that survives workflow boundaries and scales.

Deliverables

4.1 Memory Taxonomy

Memory Type	Scope	Storage	Example
Working	Current step	Workflow state	"User just said X"
Episodic	Current run	Workflow state + external	"Earlier in this conversation..."
Semantic	Cross-run	External store	"User prefers formal tone"
Procedural	Global	External store	"How to search the web"

4.2 Memory Store Interface

type MemoryStore interface {
    // Episodic
    SaveEpisode(ctx context.Context, runID string, episode Episode) error
    GetEpisodes(ctx context.Context, runID string, filter EpisodeFilter) ([]Episode, error)
    
    // Semantic (vector-backed)
    SaveFact(ctx context.Context, namespace string, fact Fact, embedding []float32) error
    QueryFacts(ctx context.Context, namespace string, query []float32, k int) ([]Fact, error)
    
    // Procedural
    SaveProcedure(ctx context.Context, name string, procedure Procedure) error
    GetProcedure(ctx context.Context, name string) (Procedure, error)
}

Implementations:

PostgresMemoryStore (recommended for most cases)
RedisMemoryStore (high-throughput, TTL-based episodic)
PineconeMemoryStore (semantic search at scale)
InMemoryStore (testing)

4.3 Automatic Memory Management

type MemoryPolicy struct {
    WorkingMemoryLimit   int           // max items in working memory
    EpisodicSummarize    int           // summarize after N episodes
    SemanticExtract      bool          // auto-extract facts to semantic store
    ConversationWindow   int           // sliding window for context
}

Automatic summarization when episodic memory grows
Fact extraction: identify and persist durable knowledge
Forgetting: TTL-based expiration, relevance-based pruning

4.4 Cross-Run Persistence

// Start new run with memory from previous runs
input := AgentRunInput{
    RunID: "run-456",
    MemoryRefs: []MemoryRef{
        {Type: "episodic", RunID: "run-123"},  // continue conversation
        {Type: "semantic", Namespace: "user-prefs"},
    },
}

Exit Criteria

Working memory bounded and performant
Episodic memory persists across continue-as-new
Semantic search returns relevant facts
Memory policies auto-manage growth

Phase 5: Multi-Agent Orchestration

Goal: First-class support for agent coordination patterns.

Deliverables

5.1 Supervisor Pattern

supervisor := NewSupervisor("project-manager").
    Agent("researcher", ResearcherAgent).
    Agent("writer", WriterAgent).
    Agent("critic", CriticAgent).
    Coordinate(func(ctx SupervisorContext) {
        research := ctx.Delegate("researcher", task)
        draft := ctx.Delegate("writer", research.Output)
        review := ctx.Delegate("critic", draft.Output)
        if review.Approved {
            return draft.Output
        }
        // iterate...
    }).
    Build()

5.2 Coordination Patterns

Pattern	Description	Use Case
Delegate	Supervisor assigns task, waits for result	Task decomposition
Broadcast	Send to all agents, collect responses	Brainstorming
Debate	Agents argue positions, supervisor decides	Decision-making
Pipeline	Chain agents sequentially	Document processing
MapReduce	Parallel processing, aggregated results	Large-scale analysis
Voting	Agents vote, majority wins	Consensus

5.3 Inter-Agent Communication

// Direct messaging (via Temporal signals)
ctx.Send("critic", Message{Type: "feedback", Content: feedback})

// Shared blackboard (external state)
ctx.Blackboard.Write("current_draft", draft)
draft := ctx.Blackboard.Read("current_draft")

// Event bus (pub/sub via Temporal)
ctx.Publish("draft_updated", DraftEvent{...})
ctx.Subscribe("draft_updated", handler)

5.4 Saga Pattern for Multi-Agent Transactions

saga := NewSaga("order-processing").
    Step("validate", ValidateAgent, CompensateValidation).
    Step("reserve", InventoryAgent, CompensateReservation).
    Step("charge", PaymentAgent, CompensateCharge).
    Step("fulfill", FulfillmentAgent, CompensateFulfillment).
    Build()

Automatic compensation on failure
Distributed transaction semantics across agents
Partial completion visibility

Exit Criteria

Supervisor can coordinate 3+ agents
Debate pattern produces reasoned output
Saga compensates correctly on mid-process failure
Blackboard state consistent across agents

Phase 6: Differentiating Features

Goal: Capabilities that don't exist in LangGraph or any current framework.

6.1 Causal Tracing

What: Track exactly which inputs influenced which outputs across the entire agent execution.

type CausalTrace struct {
    OutputID    string
    Influences  []Influence
}

type Influence struct {
    InputID     string          // which input
    InputType   string          // "user_message", "tool_result", "memory_fact"
    Pathway     []string        // steps it passed through
    Confidence  float64         // attribution strength
}

// Query: "Why did the agent recommend product X?"
trace := GetCausalTrace(runID, outputID)
// Returns: user mentioned "budget", tool returned price data, memory had preference

Enables:

Explainability: "The agent recommended X because..."
Debugging: "This output was wrong because this tool returned bad data"
Compliance: Audit trail for regulated industries

6.2 Counterfactual Replay

What: "What would have happened if...?" simulation.

// Original run made decision A at step 3
// What if it had made decision B?
counterfactual := ReplayWithOverride(runID, Override{
    StepIndex: 3,
    Decision:  DecisionB,
})

// Returns: full execution trace of alternate timeline

Enables:

Decision analysis: compare outcomes of different choices
Training data: generate (decision, outcome) pairs
Root cause: "If the agent had done X instead, would it have succeeded?"

6.3 Adaptive Retry with Learning

What: Retry policies that learn from failure patterns.

type AdaptiveRetryPolicy struct {
    BasePolicy      RetryPolicy
    LearningEnabled bool
    
    // Learned adjustments
    SuccessPatterns []Pattern      // conditions that predict success
    FailurePatterns []Pattern      // conditions that predict failure
}

// System learns: "HTTP calls to api.example.com fail 80% between 2-3pm UTC"
// Automatically: delays retries, switches to backup, or skips

6.4 Cost-Aware Execution

What: Track and optimize for cost, not just correctness.

type CostPolicy struct {
    MaxCostPerRun   float64
    MaxCostPerStep  float64
    PreferCheaper   bool            // choose cheaper model when confidence high
    CostBreakdown   bool            // detailed cost attribution
}

type StepCost struct {
    StepIndex       int
    TokenCost       float64
    ToolCost        float64         // API calls, compute, etc.
    TimeCost        time.Duration
    TotalCost       float64
}

// Query: "Which step is most expensive?"
// Action: Auto-switch to cheaper model for low-stakes decisions

6.5 Semantic Versioning for Agent Behavior

What: Version agents like software, with semantic meaning.

type AgentVersion struct {
    Version     string      // semver
    Changelog   string
    
    // Breaking change detection
    SchemaChanges   []SchemaChange    // tool interface changes
    BehaviorChanges []BehaviorChange  // detected via test suite
    
    // Compatibility
    Compatible      []string          // versions this can replace
    MigrationPath   func(State) State // state migration if needed
}

// Deploy with confidence: "v2.1.0 is backward-compatible with v2.0.x"
// Rollback safely: "v2.2.0 broke X, rolling back to v2.1.0"

6.6 Differential Testing

What: Automatically compare agent versions.

// Run same inputs through two agent versions
diff := DifferentialTest(
    AgentV1, AgentV2,
    TestSuite{
        Inputs:    []Input{...},
        Metrics:   []Metric{Correctness, Cost, Latency},
    },
)

// Output: "V2 is 15% more accurate, 20% cheaper, but 10% slower"
// Breakdown by input category, failure modes, edge cases

6.7 Live Agent Inspection

What: Attach to running agents and inspect/modify state.

// From CLI or UI
inspect --workflow-id agent-run-123

> state                     # print current state
> state.memory              # drill into memory
> set state.goal "new goal" # modify state (triggers re-evaluation)
> step                      # execute one step, pause
> continue                  # resume normal execution
> inject tool_result {...}  # inject synthetic tool result

Enables:

Debugging stuck agents in production
Manual intervention without restart
Training/demo scenarios with controlled inputs

6.8 Speculative Execution

What: Run multiple decision paths in parallel, commit the best one.

type SpeculativeConfig struct {
    Enabled         bool
    MaxBranches     int             // how many paths to explore
    SelectionFn     SelectionFunc   // how to pick winner
    CostLimit       float64         // max extra cost for speculation
}

// Agent uncertain between 3 approaches
// System runs all 3 in parallel (as child workflows)
// Commits the one that succeeds first / scores highest / costs least

Enables:

Faster resolution of uncertain decisions
Automatic A/B testing of strategies
Graceful handling of "I'm not sure, let me try both"

Phase 7A: Developer Experience — CLI + Scaffolding

Goal: Fast path to a working agent project and local iteration.

Deliverables

CLI project scaffolding:
- temporal-agent init my-agent
- temporal-agent dev (local run + hot reload)
- temporal-agent deploy --cluster prod
Environment bootstrap (compose, env templates, registry seed)
Local workflow inspection shortcuts (list runs, latest run, cost summary)

Exit Criteria

New project created in < 2 minutes
dev runs local worker + sample run end-to-end

Phase 7B: Developer Experience — Testing & Replay Harness

Goal: Deterministic tests and safe replay tooling.

Deliverables

Deterministic LLM mocks + tool mocks
Golden input tests for decision logs
Replay helpers (from step N, override decision/state)

func TestResearchAgent(t *testing.T) {
    suite := agenttest.NewSuite(t)

    // Deterministic LLM mock
    suite.MockLLM(agenttest.Deterministic(map[string]string{
        "hash-of-prompt-1": "expected-response-1",
    }))

    // Tool mocks
    suite.MockTool("http_get", func(args Args) Result {
        return Result{Body: "mock response"}
    })

    // Run and assert
    result := suite.Run(ResearchAgent, Input{Prompt: "test"})

    assert.Equal(t, "expected output", result.Output)
    assert.Equal(t, 3, result.StepCount)
    assert.NoError(t, result.Error)
}

Exit Criteria

Tests run deterministically with mocks
Replay from step N reproducible in CI

Phase 7C: Developer Experience — Web UI + VS Code Extension

Goal: First-class operator UI and editor integration.

Deliverables

Web UI

Run list view (status, duration, cost)
Run detail view (timeline, state diffs, tool calls)
Interrupt queue (approvals/rejections)
Replay console (compare original vs replay)

VS Code Extension

DSL syntax highlighting
Inline tool schema validation
Run agent + output panel
Breakpoints and step debugging
Temporal workflow visualization

Exit Criteria

UI shows all run details with drill-down
Extension provides meaningful DX improvement

Phase 7D: Provider Parity — Azure OpenAI

Goal: Azure OpenAI support with the same model catalog + cost pipeline as OpenRouter.

Deliverables

Azure OpenAI client (chat + tools) with config-based provider selection
Model catalog ingestion (list models + context limits + pricing)
Cost accounting with Azure usage metrics
Per-run/provider selection (OpenRouter or Azure) and fallback rules
Tooling parity: models-refresh, models-list, cost query

Exit Criteria

Azure and OpenRouter interchangeable via config
Cost tracking matches provider usage

Phase 7E: Sandboxed Python Tool

Goal: Provide a safe, sandboxed Python execution tool that agents can call as part of workflows.

Deliverables

python_exec tool with strict resource limits and no network access
Output capture (stdout/stderr/exit code)
Tool schema with timeouts and input payloads
Tests covering timeouts and sandbox constraints

Exit Criteria

Tool executes simple scripts safely
Timeouts enforced
Network and filesystem write access blocked

Phase 8: Integrations & Ecosystem

Goal: Batteries included for common use cases.

8.1 LLM Providers

Provider	Status	Features
OpenAI	Priority	GPT-4, function calling, vision
Anthropic	Priority	Claude, tool use
Google	Phase 2	Gemini
AWS Bedrock	Phase 2	Multi-model
Local (Ollama)	Phase 2	Self-hosted
Azure OpenAI	Phase 7D	Enterprise

8.2 Tool Kits

Kit	Tools
Web	HTTP GET/POST, scrape, screenshot
Search	Google, Bing, DuckDuckGo, Tavily
Database	SQL query, vector search
Files	Read, write, parse (PDF, DOCX, etc.)
Code	Execute Python/JS, sandbox
Communication	Email, Slack, SMS
Cloud	AWS, GCP, Azure SDKs

8.3 Vector Stores

Pinecone
Weaviate
Qdrant
pgvector
Chroma (local)

8.4 Observability

OpenTelemetry (native)
Datadog
Honeycomb
Grafana stack

8.5 Auth & Secrets

Vault integration
AWS Secrets Manager
Environment injection
OAuth flows for tools

Phase 9: Enterprise Hardening

Goal: Production-ready for serious workloads.

Deliverables

9.1 Multi-Tenancy

type TenantConfig struct {
    TenantID        string
    Namespace       string          // Temporal namespace isolation
    ResourceQuotas  ResourceQuotas
    RateLimits      RateLimits
    AllowedModels   []string
    AllowedTools    []string
}

9.2 Rate Limiting & Quotas

Per-tenant rate limits
Per-model token quotas
Per-tool call limits
Graceful degradation on quota exhaustion

9.3 Audit Logging

type AuditEvent struct {
    Timestamp   time.Time
    TenantID    string
    RunID       string
    EventType   string      // "decision", "tool_call", "state_change", "approval"
    Actor       string      // "agent", "user:123", "system"
    Details     any
    Hash        string      // tamper-evident
}

Immutable audit log
Compliance exports (SOC2, HIPAA formats)
PII detection and redaction

9.4 Security

Tool sandboxing (gVisor, Firecracker)
Input sanitization
Output filtering
Network policies for tools

9.5 High Availability

Multi-region Temporal deployment guide
State replication strategies
Disaster recovery playbook

Exit Criteria

Tenants isolated at namespace level
Rate limits enforced without race conditions
Audit log tamper-evident
Security review passed

Timeline (Estimated)

Phase	Duration	Cumulative
Phase 1: MVP	4-6 weeks	6 weeks
Phase 2: Checkpoints/HITL	4-6 weeks	12 weeks
Phase 3: Graphs	3-4 weeks	16 weeks
Phase 4: Memory	4-6 weeks	22 weeks
Phase 5: Multi-Agent	4-6 weeks	28 weeks
Phase 6: Differentiators	6-8 weeks	36 weeks
Phase 7A: DevEx CLI	2-3 weeks	38-39 weeks
Phase 7B: DevEx Testing	2-3 weeks	40-42 weeks
Phase 7C: DevEx UI/Extension	4-6 weeks	44-48 weeks
Phase 7D: Azure Provider Parity	2-3 weeks	46-51 weeks
Phase 8: Integrations	Ongoing	-
Phase 9: Enterprise	6-8 weeks	50 weeks

Note: Phases 7-8 can run in parallel with 5-6. Enterprise hardening (Phase 9) can begin after Phase 2 if there's early enterprise interest.

Success Metrics

Adoption

GitHub stars (vanity but visibility)
Production deployments (real metric)
Community contributions

Technical

Agent survival rate (% completing without crash)
Mean time to recovery (on failure)
P99 step latency

Developer Experience

Time to first working agent
Test coverage achievable
Debug time for failures

Differentiation

Features not available elsewhere
Performance vs alternatives
Operational clarity vs alternatives

Open Questions

DSL syntax: Go-native (method chaining) vs external config (YAML/JSON)?
State serialization: Protobuf vs JSON vs custom?
Memory store default: Postgres (safe) vs Redis (fast)?
UI: Build custom vs extend Temporal UI?
Licensing: Apache 2.0 vs MIT vs BSL?

Appendix: Why Not Just Use LangGraph?

LangGraph is excellent for prototyping. It falls short for production:

Concern	LangGraph	This Runtime
Durability	Library checkpointers (Postgres, Redis)	Temporal cluster (distributed, battle-tested)
Failure handling	Manual retry logic	Infrastructure-level retries, timeouts, circuit breakers
Multi-agent	Graph composition	First-class supervision, sagas, compensation
Debugging	Thread history	Full replay, counterfactual analysis, causal tracing
Operational control	Basic	Signals, queries, live inspection, hot fixes
Scale	Single process	Distributed workers, horizontal scale
Long-running	Memory issues	Continue-as-new, bounded histories

We're not building "LangGraph but different." We're building a production-grade agent runtime that happens to match (and exceed) LangGraph's capabilities.

Last updated: Phase 0 (Planning)

Vision​

Differentiators (Why This Exists)​

Phase 1: Durable Single-Agent Loop (MVP)​

Deliverables​

1.1 Core Workflow: AgentRunWorkflow​

1.2 Activities​

1.3 Tool Registry (Immutable + Versioned from Day 1)​

1.4 Observability (Production-Shaped from Start)​

1.5 Example Project​

Exit Criteria​

Phase 2: Checkpoints, Replay, and Human-in-the-Loop​

Deliverables​

2.1 Interrupt System (Signals + Queries)​

2.2 Decision Log (First-Class Forensics)​

2.3 Replay and Time-Travel Debugging​

2.4 Continue-As-New for Longevity​

2.5 Failure Semantics​

2.6 MCP Tool Bridge (External Tool Servers)​

Exit Criteria​

Phase 3: Structured Execution Graphs​

Deliverables​

3.1 Step DSL​

3.2 Conditional Transitions​

3.3 Parallel Execution with Join Semantics​

3.4 Subgraph Composition​

Exit Criteria​

Phase 4: Memory Architecture​

Deliverables​

4.1 Memory Taxonomy​

4.2 Memory Store Interface​

4.3 Automatic Memory Management​

4.4 Cross-Run Persistence​

Exit Criteria​

Phase 5: Multi-Agent Orchestration​

Deliverables​

5.1 Supervisor Pattern​

5.2 Coordination Patterns​

5.3 Inter-Agent Communication​

5.4 Saga Pattern for Multi-Agent Transactions​

Exit Criteria​

Phase 6: Differentiating Features​

6.1 Causal Tracing​

6.2 Counterfactual Replay​

6.3 Adaptive Retry with Learning​

6.4 Cost-Aware Execution​

6.5 Semantic Versioning for Agent Behavior​

6.6 Differential Testing​

6.7 Live Agent Inspection​

6.8 Speculative Execution​

Phase 7A: Developer Experience — CLI + Scaffolding​

Deliverables​

Exit Criteria​

Phase 7B: Developer Experience — Testing & Replay Harness​

Deliverables​

Exit Criteria​

Phase 7C: Developer Experience — Web UI + VS Code Extension​

Deliverables​

Exit Criteria​

Phase 7D: Provider Parity — Azure OpenAI​

Deliverables​

Exit Criteria​

Phase 7E: Sandboxed Python Tool​

Deliverables​

Exit Criteria​

Phase 8: Integrations & Ecosystem​

8.1 LLM Providers​

8.2 Tool Kits​

8.3 Vector Stores​

8.4 Observability​

8.5 Auth & Secrets​

Phase 9: Enterprise Hardening​

Deliverables​

9.1 Multi-Tenancy​

9.2 Rate Limiting & Quotas​

9.3 Audit Logging​

9.4 Security​

9.5 High Availability​

Exit Criteria​

Timeline (Estimated)​

Success Metrics​

Vision

Differentiators (Why This Exists)

Phase 1: Durable Single-Agent Loop (MVP)

Deliverables

1.1 Core Workflow: `AgentRunWorkflow`

1.2 Activities

1.3 Tool Registry (Immutable + Versioned from Day 1)

1.4 Observability (Production-Shaped from Start)

1.5 Example Project

Exit Criteria

Phase 2: Checkpoints, Replay, and Human-in-the-Loop

Deliverables

2.1 Interrupt System (Signals + Queries)

2.2 Decision Log (First-Class Forensics)

2.3 Replay and Time-Travel Debugging

2.4 Continue-As-New for Longevity

2.5 Failure Semantics

2.6 MCP Tool Bridge (External Tool Servers)

Exit Criteria

Phase 3: Structured Execution Graphs

Deliverables

3.1 Step DSL

3.2 Conditional Transitions

3.3 Parallel Execution with Join Semantics

3.4 Subgraph Composition

Exit Criteria

Phase 4: Memory Architecture

Deliverables

4.1 Memory Taxonomy

4.2 Memory Store Interface

4.3 Automatic Memory Management

4.4 Cross-Run Persistence

Exit Criteria

Phase 5: Multi-Agent Orchestration

Deliverables

5.1 Supervisor Pattern

5.2 Coordination Patterns

5.3 Inter-Agent Communication

5.4 Saga Pattern for Multi-Agent Transactions

Exit Criteria

Phase 6: Differentiating Features

6.1 Causal Tracing

6.2 Counterfactual Replay

6.3 Adaptive Retry with Learning

6.4 Cost-Aware Execution

6.5 Semantic Versioning for Agent Behavior

6.6 Differential Testing

6.7 Live Agent Inspection

6.8 Speculative Execution

Phase 7A: Developer Experience — CLI + Scaffolding

Deliverables

Exit Criteria

Phase 7B: Developer Experience — Testing & Replay Harness

Deliverables

Exit Criteria

Phase 7C: Developer Experience — Web UI + VS Code Extension

Deliverables

Exit Criteria

Phase 7D: Provider Parity — Azure OpenAI

Deliverables

Exit Criteria

Phase 7E: Sandboxed Python Tool

Deliverables

Exit Criteria

Phase 8: Integrations & Ecosystem

8.1 LLM Providers

8.2 Tool Kits

8.3 Vector Stores

8.4 Observability

8.5 Auth & Secrets

Phase 9: Enterprise Hardening

Deliverables

9.1 Multi-Tenancy

9.2 Rate Limiting & Quotas

9.3 Audit Logging

9.4 Security

9.5 High Availability

Exit Criteria

Timeline (Estimated)

Success Metrics