Durable Execution
Durable execution is a runtime guarantee that a long-running process survives crashes, timeouts, and restarts by checkpointing state at every step.
Durable execution is a runtime guarantee that a long-running process completes correctly even after crashes, timeouts, or infrastructure restarts - by persisting state at every step boundary and resuming from the last checkpoint.
KEY TAKEAWAYS
- Durable execution means a failed step retries from where it stopped - not from the beginning of the entire run.
- Without durable execution, any infrastructure failure mid-workflow loses all progress made before the crash.
- Durable execution decouples logical run time from compute time - a workflow can span days while using compute only during active steps.
- Checkpointing state after every step is the mechanism that makes execution durable - not retries alone.
- Calljmp provides durable execution as a core runtime primitive - every workflow step is checkpointed automatically, with no additional code required.
WHAT IS DURABLE EXECUTION?
Durable execution is a property of a runtime that guarantees a long-running process will complete its intended work despite infrastructure failures. The runtime achieves this by serializing and persisting execution state to durable storage after each step completes. If the process crashes, the runtime restores state from the last checkpoint and continues from that point - skipping steps that already succeeded.
Durable execution is not the same as retry logic. A retry restarts a failed operation from the beginning. Durable execution resumes a failed process from the exact step that failed - with all prior state intact. A 20-step workflow that fails at step 18 resumes at step 18, not step 1. This distinction matters enormously for long-running agent workflows where early steps involve expensive model calls, file processing, or irreversible side effects.
HOW DURABLE EXECUTION WORKS
- Initialize a run context. The runtime creates a unique execution record - a run ID, input payload, and initial state - stored in durable storage before execution begins.
- Execute a step. The runtime runs the next unit of work - a model call, a tool invocation, a computation - inside an isolated execution scope.
- Checkpoint on success. After the step completes successfully, the runtime writes the step output and updated state to durable storage before proceeding to the next step.
- Detect failure. If a step throws, times out, or the process crashes, the runtime detects the incomplete state on the next execution attempt.
- Resume from checkpoint. The runtime restores the last persisted state and re-enters execution at the failed step - replaying inputs but not re-running completed steps.
- Complete and close. Once all steps succeed, the runtime marks the run as complete, stores the final trace, and releases the execution context.
The critical infrastructure requirement: the durable storage layer must be consistent and fast enough to checkpoint state without adding prohibitive latency to each step. If checkpointing is too slow, teams skip it - which defeats the guarantee entirely.
COMPARISON TABLE
| Dimension | No persistence | Retry logic only | Durable execution |
|---|---|---|---|
| Failure recovery | Full restart from step 1 | Full restart from step 1 | Resume from last completed step |
| State on crash | Lost entirely | Lost entirely | Persisted to durable storage |
| Completed steps re-run | Yes - always | Yes - always | No - skipped on resume |
| Supports multi-day runs | No | No | Yes - compute released between steps |
| Best for | Short, stateless tasks | Idempotent single operations | Long-running, multi-step agent workflows |
| Main trade-off | Fragile for anything non-trivial | Wastes completed work on failure | Requires durable storage infrastructure |
What This Means for Your Business
Every time an AI agent crashes halfway through a task and has to start over, someone pays - in compute costs re-running completed steps, in time waiting for a re-run to finish, and in risk if the failed step had already sent an email, charged a card, or updated a record.
- Partial failures stop being disasters. A crash at step 15 of a 20-step workflow is a minor delay, not a full restart. For agents processing high-value work - contract reviews, financial reconciliations, customer onboarding - this is the difference between a recoverable incident and a costly one.
- Long-running agents become viable products. Workflows that require hours or days - multi-stage approvals, overnight batch processing, async research tasks - are only practical if the runtime can hold state across that time. Without durable execution, these workflows are too fragile to ship.
- Infrastructure failures stop appearing in user-facing errors. A server restart, a cold start, a timeout - none of these reach the end user when the runtime guarantees completion. The agent finishes. The user gets a result.
Ready to run agents that survive any failure?
Calljmp checkpoints every workflow step automatically — agents resume from the last completed step after any crash
Start free — no card neededFAQ
What is the difference between durable execution and a retry mechanism?
A retry mechanism restarts a failed operation from its beginning - it has no memory of what succeeded before the failure. Durable execution persists the output of every completed step, so a failure mid-workflow resumes at the failed step with all prior state intact. For a 10-step workflow where step 9 fails, a retry re-runs all 10 steps; durable execution re-runs only step 9. The cost difference compounds significantly in workflows with expensive model calls or irreversible side effects in early steps.
Does durable execution prevent side effects from running twice?
Durable execution prevents completed steps from re-running - so a side effect in a successfully completed step will not repeat on resume. However, if a step fails after producing a side effect but before the checkpoint is written, that side effect may execute again on retry. Production systems handle this by designing steps to be idempotent where possible - meaning re-running the step produces the same result without duplicating the effect. Durable execution reduces the blast radius of this problem but does not eliminate it entirely.
How does durable execution handle workflows that pause for days?
The runtime persists the full execution state to durable storage and releases compute when a pause is encountered - for a human approval, an external webhook, or a scheduled delay. No process stays alive during the wait. When the resume signal arrives, the runtime restores state from storage and continues execution in a fresh process. The workflow's logical run time can span days; the actual compute consumption is only the seconds or minutes of active processing.
Is durable execution only relevant for long-running workflows?
It is most critical for long-running workflows, but it provides value for any multi-step workflow where re-running completed steps has a cost. A 30-second workflow with 8 steps that calls three external APIs still benefits from durable execution - a failure at step 7 avoids re-running the 6 API calls that already succeeded. The value scales with step count, step cost, and the presence of side effects - not just with total run duration. Calljmp applies durable execution to all workflow runs by default, regardless of expected duration.