Skip to content

Commit d73f2ad

Browse files
swarm:codex-blog-v2-3johnlindquist
authored andcommitted
docs(blog): replace stateless retries article text
Overwrite the target blog file with the exact markdown content provided for this task.\n\nThis preserves all headings, prose, and fenced code examples verbatim.\n\nVerified: cat <<'EOF' | diff -u - .blog/stateless-retries-multiply-agent-side-effects.md (no diff; exit 0)\nVerified: repository Biome config does not include Markdown files (.md), so no project-scoped markdown linter applies
1 parent 8d48f33 commit d73f2ad

File tree

1 file changed

+101
-50
lines changed

1 file changed

+101
-50
lines changed
Lines changed: 101 additions & 50 deletions
Original file line numberDiff line numberDiff line change
@@ -1,83 +1,134 @@
1-
# Stateless retries multiply agent side effects
1+
# Workflow DevKit keeps Agents' tool-call volume linear under retries
22

33
## Headline finding
44

5-
Stateless architectures turn agent reliability problems into duplication problems. Any restart — timeout, crash, deploy, scale event — pushes you toward re-executing earlier tool calls unless you built a persistence layer that can prove what already happened.
5+
Stateless retries turn an Agent's tool calls into repeated work. As the number of tool calls per run grows, the expected number of executed calls grows faster than linearly because a single failure forces a full replay of the prefix.
66

7-
Durable execution flips that. It records step inputs and outputs in an event log and deterministically replays the agent loop, so restarts rehydrate state instead of repeating side effects.
7+
Workflow DevKit changes the unit of retry. The workflow replays deterministically, but completed steps return recorded results. A transient failure retries one step, not the entire Agent turn.
88

99
## Methodology
1010

11-
Model an agent run as a sequence of `n` tool calls. Each call succeeds with probability `s` and fails transiently with probability `p = 1 - s`.
11+
Model an Agent run as `N` sequential tool calls. Each call fails transiently with probability `p` and succeeds with probability `q = 1 - p`.
1212

13-
Compare two implementations:
13+
Compare two retry strategies:
1414

15-
- **Stateless restart:** on any failure (or timeout that looks like failure), rerun the whole function from the beginning. This matches the common "retry the request" approach when you lack per-call checkpointing.
16-
- **Durable steps:** isolate each tool call in a step (`'use step'`). If a call fails transiently, retry that step. The workflow (`'use workflow'`) replays from the event log and skips completed steps.
15+
* **Stateless retry:** a failure restarts the whole run from tool call 1.
16+
* **Durable step retry:** a failure retries only the failed call; prior successful calls do not re-execute.
1717

18-
This is a simplified model. It ignores correlated failures and assumes you can retry until success. In the real system you cap retries (Workflow DevKit defaults to 3 retries unless you override `stepFn.maxRetries`) and you may delay retries with `RetryableError`.
18+
This isolates the retry surface area. It does not assume anything about the LLM or tools beyond an independent per-call failure rate.
1919

2020
## Data
2121

22-
The cost metric is "how many tool calls do we execute to finish one successful run," because tool calls dominate agent cost and risk (tokens, rate limits, side effects).
22+
With stateless retry, the run completes only after it achieves `N` consecutive successful calls. The expected number of executed calls is:
2323

24-
For the stateless restart model, the run succeeds only after `n` consecutive successes. The expected number of calls until that happens is:
24+
`E_stateless = (1 - q^N) / (p * q^N)`
2525

26-
`E[calls] = (1 - s^n) / (p * s^n)`
26+
With durable step retry, each call is a geometric retry until success, so:
2727

28-
For durable steps, each call retries independently. The expected number of call attempts is:
28+
`E_durable = N / q`
2929

30-
`E[calls] = n / s`
30+
Concrete numbers:
3131

32-
| Steps in run (`n`) | Transient failure rate (`p`) | Stateless restart: expected calls | Durable steps: expected calls | Restart overhead | Durable overhead |
33-
| --: | --: | --: | --: | --: | --: |
34-
| 10 | 1% | 10.57 | 10.10 | 1.06x | 1.01x |
35-
| 20 | 1% | 22.26 | 20.20 | 1.11x | 1.01x |
36-
| 40 | 1% | 49.48 | 40.40 | 1.24x | 1.01x |
37-
| 10 | 5% | 13.40 | 10.53 | 1.34x | 1.05x |
38-
| 20 | 5% | 35.79 | 21.05 | 1.79x | 1.05x |
39-
| 40 | 5% | 135.63 | 42.11 | 3.39x | 1.05x |
32+
* `p = 0.02`, `N = 40`: stateless `62.2` calls vs durable `40.8` calls (1.52x).
33+
* `p = 0.05`, `N = 20`: stateless `35.8` calls vs durable `21.1` calls (1.70x).
34+
* `p = 0.10`, `N = 40`: stateless `666.5` calls vs durable `44.4` calls (15.0x).
4035

41-
The gap grows with run length. At a 5% transient failure rate across 40 calls, the stateless restart model executes ~3.4x the work, on average, to get one successful completion.
42-
43-
That "extra work" is not free retries. It is duplicated tool calls. If any of those tool calls write to external systems, you also created duplicated side effects unless you designed every tool integration to be idempotent.
36+
The ratio compounds because stateless retry forces the run to finish the entire chain without a single transient failure. Durable steps turn that into independent retries per call.
4437

4538
## Core insight
4639

47-
Agent workflows compound failure probability. A single run touches many systems, and each system has its own tail latency and transient errors.
48-
49-
Stateless runtimes give you one recovery primitive: re-execute code. That works for pure functions. Agents are not pure. They read and write external state, and they do it many times per run.
50-
51-
Workflow DevKit uses a different recovery primitive: replay. The workflow function runs in a deterministic sandbox. Steps isolate side effects and persist their results to an append-only event log. On restart, the workflow replays and step calls resolve from recorded outputs. A transient failure retries only the failing step.
40+
In agent workloads, the expensive part is not the control flow. It is the tool boundary: API calls, database writes, emails, payments, rate-limited endpoints. Stateless retry replays those boundaries unless the application builds its own ledger of what already executed.
5241

53-
This is the practical difference between "retry is correct" and "retry is dangerous."
42+
That ledger is the same thing a durable runtime provides: an event log keyed by stable correlation ids. Workflow DevKit already emits a correlation id per step and records its lifecycle (`created`, `started`, `retrying`, `completed`, `failed`). Replay rehydrates the workflow and returns step results without re-executing successful calls.
5443

5544
## Practical takeaway
5645

57-
If your agent touches external systems, treat every tool call as a durable step. Keep orchestration in the workflow and error policy in steps (`FatalError` for permanent failures, `RetryableError` for transient failures with backoff).
46+
Use durable steps for every side-effecting tool call. Keep the workflow function deterministic and let the runtime handle replay and selective retry. If a tool supports idempotency keys, derive the key from the step correlation id instead of inventing your own scheme.
47+
48+
### Stateless retry duplicates work
49+
50+
**Before: retrying an Agent turn replays the full prefix**
51+
52+
```ts
53+
export async function agentTurn(input: Input) {
54+
for (let attempt = 1; attempt <= 5; attempt += 1) {
55+
try {
56+
const a = await toolA(input);
57+
const b = await toolB(a);
58+
const c = await toolC(b);
59+
return { a, b, c };
60+
} catch (err) {
61+
if (attempt === 5) throw err;
62+
await sleepMs(1000 * attempt);
63+
}
64+
}
65+
throw new Error("unreachable");
66+
}
67+
```
5868

59-
```bash
60-
npx workflow inspect run <run_id>
69+
**After: durable steps replay successful calls and retry only the failed one**
70+
71+
```ts
72+
import { RetryableError } from "workflow";
73+
74+
async function toolA(input: Input) { 'use step'; return callA(input); }
75+
async function toolB(a: A) { 'use step'; return callB(a); }
76+
async function toolC(b: B) {
77+
'use step';
78+
const res = await callC(b);
79+
if (res.transient === true) throw new RetryableError("toolC transient", { retryAfter: "2s" });
80+
return res;
81+
}
82+
83+
export async function agentTurn(input: Input) {
84+
'use workflow';
85+
const a = await toolA(input);
86+
const b = await toolB(a);
87+
return await toolC(b);
88+
}
6189
```
6290

63-
---
91+
### Stop managing idempotency keys by hand
6492

65-
## Style justification
93+
**Before: generating and persisting idempotency keys across retries**
6694

67-
**What works extremely well:**
68-
- Title follows the research post pattern perfectly — states the finding ("Stateless retries multiply agent side effects"), not the question. Compare to "AGENTS.md outperforms skills in our agent evals."
69-
- The data table is the strongest element. It follows the Vercel research pattern: simplest possible table, one clear variable, ascending values that build to the headline result (3.39x overhead). The "AGENTS.md" post used the same structure.
70-
- Mathematical formulas add genuine authority. This is not opinion — it is a derivable result. Vercel research posts lead with data, not argument.
71-
- "This is a simplified model" matches the Vercel pattern of honest difficulty: acknowledge limitations as technical facts, not apologies.
72-
- "retry is correct" vs. "retry is dangerous" — this closing line is the kind of quotable insight Vercel research posts end on. Compare to "Passive context beats active retrieval."
73-
- Paragraphs are tight. The core insight section is 4 sentences, 3 paragraphs. Maximum density.
95+
```ts
96+
import { sql } from "./db";
97+
import { randomUUID } from "crypto";
7498

75-
**What could be stronger:**
76-
- The methodology section describes a theoretical model, not empirical data. The "AGENTS.md" post ran actual evals with pass rates. A reader might ask "did you measure this on real agent runs?" Adding even one real-world data point (e.g., "across 1,000 production agent runs, we observed a 4.2% transient failure rate") would close that gap.
77-
- The formulas assume geometric distributions and independence — reasonable but worth stating explicitly for a research-style post. The "simplified model" caveat partially covers this.
78-
- No visual. Vercel research posts with tables often benefit from a chart showing the divergence curve. The 3.39x figure at 40 steps / 5% failure rate deserves a visual.
99+
export async function purchase(runId: string, userId: string) {
100+
const row = await sql`SELECT charge_key, email_key FROM runs WHERE id=${runId}`;
101+
const chargeKey = row.charge_key ?? randomUUID();
102+
const emailKey = row.email_key ?? randomUUID();
103+
await sql`UPDATE runs SET charge_key=${chargeKey}, email_key=${emailKey} WHERE id=${runId}`;
104+
await stripe.charges.create({ amount: 499, customer: userId }, { idempotencyKey: chargeKey });
105+
await sendReceiptEmail(userId, { idempotencyKey: emailKey });
106+
}
107+
```
108+
109+
**After: use the step correlation id as the idempotency key**
110+
111+
```ts
112+
import { getStepMetadata } from "workflow";
113+
114+
async function chargeCard(userId: string, amount: number) {
115+
'use step';
116+
const { stepId } = getStepMetadata();
117+
return stripe.charges.create({ amount, customer: userId }, { idempotencyKey: stepId });
118+
}
119+
async function sendReceipt(userId: string) {
120+
'use step';
121+
const { stepId } = getStepMetadata();
122+
await mailer.sendReceipt({ userId }, { idempotencyKey: stepId });
123+
}
124+
125+
export async function purchase(userId: string) {
126+
'use workflow';
127+
await chargeCard(userId, 499);
128+
await sendReceipt(userId);
129+
}
130+
```
79131

80-
**Alternative approaches:**
81-
1. **Empirical-first:** Run actual agent workloads on stateless vs. durable, measure tool call counts, and report observed overhead. Replace the theoretical model with production data. Harder to produce but more authoritative.
82-
2. **Side-effect focused:** Narrow the article to side-effect duplication specifically. Title: "Agent retries duplicate side effects at scale." Drop the formula and instead catalog real failure scenarios: double-charged payments, duplicate tickets, repeated emails. More visceral for engineering readers.
83-
3. **Comparative architecture:** Add a third column to the table — "stateless with manual checkpointing" — showing the engineering effort to approach durable performance without the framework. This makes the build-vs-buy argument explicit without stating it.
132+
```bash
133+
npx -y -p @workflow/cli wf inspect runs
134+
```

0 commit comments

Comments
 (0)