Skip to main content

Durable Execution

Durable execution is a runtime guarantee that a long-running process survives crashes, timeouts, and restarts by checkpointing state at every step.

Durable execution is a runtime guarantee that a long-running process completes correctly even after crashes, timeouts, or infrastructure restarts - by persisting state at every step boundary and resuming from the last checkpoint.

KEY TAKEAWAYS

  • Durable execution means a failed step retries from where it stopped - not from the beginning of the entire run.
  • Without durable execution, any infrastructure failure mid-workflow loses all progress made before the crash.
  • Durable execution decouples logical run time from compute time - a workflow can span days while using compute only during active steps.
  • Checkpointing state after every step is the mechanism that makes execution durable - not retries alone.
  • Calljmp provides durable execution as a core runtime primitive - every workflow step is checkpointed automatically, with no additional code required.

WHAT IS DURABLE EXECUTION?

Durable execution is a property of a runtime that guarantees a long-running process will complete its intended work despite infrastructure failures. The runtime achieves this by serializing and persisting execution state to durable storage after each step completes. If the process crashes, the runtime restores state from the last checkpoint and continues from that point - skipping steps that already succeeded.

Durable execution is not the same as retry logic. A retry restarts a failed operation from the beginning. Durable execution resumes a failed process from the exact step that failed - with all prior state intact. A 20-step workflow that fails at step 18 resumes at step 18, not step 1. This distinction matters enormously for long-running agent workflows where early steps involve expensive model calls, file processing, or irreversible side effects.

HOW DURABLE EXECUTION WORKS

  1. Initialize a run context. The runtime creates a unique execution record - a run ID, input payload, and initial state - stored in durable storage before execution begins.
  2. Execute a step. The runtime runs the next unit of work - a model call, a tool invocation, a computation - inside an isolated execution scope.
  3. Checkpoint on success. After the step completes successfully, the runtime writes the step output and updated state to durable storage before proceeding to the next step.
  4. Detect failure. If a step throws, times out, or the process crashes, the runtime detects the incomplete state on the next execution attempt.
  5. Resume from checkpoint. The runtime restores the last persisted state and re-enters execution at the failed step - replaying inputs but not re-running completed steps.
  6. Complete and close. Once all steps succeed, the runtime marks the run as complete, stores the final trace, and releases the execution context.

The critical infrastructure requirement: the durable storage layer must be consistent and fast enough to checkpoint state without adding prohibitive latency to each step. If checkpointing is too slow, teams skip it - which defeats the guarantee entirely.

COMPARISON TABLE

DimensionNo persistenceRetry logic onlyDurable execution
Failure recoveryFull restart from step 1Full restart from step 1Resume from last completed step
State on crashLost entirelyLost entirelyPersisted to durable storage
Completed steps re-runYes - alwaysYes - alwaysNo - skipped on resume
Supports multi-day runsNoNoYes - compute released between steps
Best forShort, stateless tasksIdempotent single operationsLong-running, multi-step agent workflows
Main trade-offFragile for anything non-trivialWastes completed work on failureRequires durable storage infrastructure

What This Means for Your Business

Every time an AI agent crashes halfway through a task and has to start over, someone pays - in compute costs re-running completed steps, in time waiting for a re-run to finish, and in risk if the failed step had already sent an email, charged a card, or updated a record.

  • Partial failures stop being disasters. A crash at step 15 of a 20-step workflow is a minor delay, not a full restart. For agents processing high-value work - contract reviews, financial reconciliations, customer onboarding - this is the difference between a recoverable incident and a costly one.
  • Long-running agents become viable products. Workflows that require hours or days - multi-stage approvals, overnight batch processing, async research tasks - are only practical if the runtime can hold state across that time. Without durable execution, these workflows are too fragile to ship.
  • Infrastructure failures stop appearing in user-facing errors. A server restart, a cold start, a timeout - none of these reach the end user when the runtime guarantees completion. The agent finishes. The user gets a result.

Ready to run agents that survive any failure?

Calljmp checkpoints every workflow step automatically — agents resume from the last completed step after any crash

Start free — no card needed

FAQ

What is the difference between durable execution and a retry mechanism?

A retry mechanism restarts a failed operation from its beginning - it has no memory of what succeeded before the failure. Durable execution persists the output of every completed step, so a failure mid-workflow resumes at the failed step with all prior state intact. For a 10-step workflow where step 9 fails, a retry re-runs all 10 steps; durable execution re-runs only step 9. The cost difference compounds significantly in workflows with expensive model calls or irreversible side effects in early steps.

Does durable execution prevent side effects from running twice?

Durable execution prevents completed steps from re-running - so a side effect in a successfully completed step will not repeat on resume. However, if a step fails after producing a side effect but before the checkpoint is written, that side effect may execute again on retry. Production systems handle this by designing steps to be idempotent where possible - meaning re-running the step produces the same result without duplicating the effect. Durable execution reduces the blast radius of this problem but does not eliminate it entirely.

How does durable execution handle workflows that pause for days?

The runtime persists the full execution state to durable storage and releases compute when a pause is encountered - for a human approval, an external webhook, or a scheduled delay. No process stays alive during the wait. When the resume signal arrives, the runtime restores state from storage and continues execution in a fresh process. The workflow's logical run time can span days; the actual compute consumption is only the seconds or minutes of active processing.

Is durable execution only relevant for long-running workflows?

It is most critical for long-running workflows, but it provides value for any multi-step workflow where re-running completed steps has a cost. A 30-second workflow with 8 steps that calls three external APIs still benefits from durable execution - a failure at step 7 avoids re-running the 6 API calls that already succeeded. The value scales with step count, step cost, and the presence of side effects - not just with total run duration. Calljmp applies durable execution to all workflow runs by default, regardless of expected duration.

More from the glossary

Continue learning with more definitions and concepts from the Calljmp glossary.

Agentic Backend

Agentic Backend

An agentic backend is the infrastructure layer that handles execution, state, memory, and observability for AI agents running in production.

Agentic Memory

Agentic Memory

Agentic memory is the mechanism by which an AI agent stores, retrieves, and updates information across steps and sessions beyond a single context window.

Agentic RAG

Agentic RAG

Agentic RAG is a retrieval pattern where an AI agent decides what to retrieve, when, and from where - dynamically, across multiple steps. Learn how it works in production.