|
1 | | -# Stateless retries multiply agent side effects |
| 1 | +# Workflow DevKit keeps Agents' tool-call volume linear under retries |
2 | 2 |
|
3 | 3 | ## Headline finding |
4 | 4 |
|
5 | | -Stateless architectures turn agent reliability problems into duplication problems. Any restart — timeout, crash, deploy, scale event — pushes you toward re-executing earlier tool calls unless you built a persistence layer that can prove what already happened. |
| 5 | +Stateless retries turn an Agent's tool calls into repeated work. As the number of tool calls per run grows, the expected number of executed calls grows faster than linearly because a single failure forces a full replay of the prefix. |
6 | 6 |
|
7 | | -Durable execution flips that. It records step inputs and outputs in an event log and deterministically replays the agent loop, so restarts rehydrate state instead of repeating side effects. |
| 7 | +Workflow DevKit changes the unit of retry. The workflow replays deterministically, but completed steps return recorded results. A transient failure retries one step, not the entire Agent turn. |
8 | 8 |
|
9 | 9 | ## Methodology |
10 | 10 |
|
11 | | -Model an agent run as a sequence of `n` tool calls. Each call succeeds with probability `s` and fails transiently with probability `p = 1 - s`. |
| 11 | +Model an Agent run as `N` sequential tool calls. Each call fails transiently with probability `p` and succeeds with probability `q = 1 - p`. |
12 | 12 |
|
13 | | -Compare two implementations: |
| 13 | +Compare two retry strategies: |
14 | 14 |
|
15 | | -- **Stateless restart:** on any failure (or timeout that looks like failure), rerun the whole function from the beginning. This matches the common "retry the request" approach when you lack per-call checkpointing. |
16 | | -- **Durable steps:** isolate each tool call in a step (`'use step'`). If a call fails transiently, retry that step. The workflow (`'use workflow'`) replays from the event log and skips completed steps. |
| 15 | +* **Stateless retry:** a failure restarts the whole run from tool call 1. |
| 16 | +* **Durable step retry:** a failure retries only the failed call; prior successful calls do not re-execute. |
17 | 17 |
|
18 | | -This is a simplified model. It ignores correlated failures and assumes you can retry until success. In the real system you cap retries (Workflow DevKit defaults to 3 retries unless you override `stepFn.maxRetries`) and you may delay retries with `RetryableError`. |
| 18 | +This isolates the retry surface area. It does not assume anything about the LLM or tools beyond an independent per-call failure rate. |
19 | 19 |
|
20 | 20 | ## Data |
21 | 21 |
|
22 | | -The cost metric is "how many tool calls do we execute to finish one successful run," because tool calls dominate agent cost and risk (tokens, rate limits, side effects). |
| 22 | +With stateless retry, the run completes only after it achieves `N` consecutive successful calls. The expected number of executed calls is: |
23 | 23 |
|
24 | | -For the stateless restart model, the run succeeds only after `n` consecutive successes. The expected number of calls until that happens is: |
| 24 | +`E_stateless = (1 - q^N) / (p * q^N)` |
25 | 25 |
|
26 | | -`E[calls] = (1 - s^n) / (p * s^n)` |
| 26 | +With durable step retry, each call is a geometric retry until success, so: |
27 | 27 |
|
28 | | -For durable steps, each call retries independently. The expected number of call attempts is: |
| 28 | +`E_durable = N / q` |
29 | 29 |
|
30 | | -`E[calls] = n / s` |
| 30 | +Concrete numbers: |
31 | 31 |
|
32 | | -| Steps in run (`n`) | Transient failure rate (`p`) | Stateless restart: expected calls | Durable steps: expected calls | Restart overhead | Durable overhead | |
33 | | -| --: | --: | --: | --: | --: | --: | |
34 | | -| 10 | 1% | 10.57 | 10.10 | 1.06x | 1.01x | |
35 | | -| 20 | 1% | 22.26 | 20.20 | 1.11x | 1.01x | |
36 | | -| 40 | 1% | 49.48 | 40.40 | 1.24x | 1.01x | |
37 | | -| 10 | 5% | 13.40 | 10.53 | 1.34x | 1.05x | |
38 | | -| 20 | 5% | 35.79 | 21.05 | 1.79x | 1.05x | |
39 | | -| 40 | 5% | 135.63 | 42.11 | 3.39x | 1.05x | |
| 32 | +* `p = 0.02`, `N = 40`: stateless `62.2` calls vs durable `40.8` calls (1.52x). |
| 33 | +* `p = 0.05`, `N = 20`: stateless `35.8` calls vs durable `21.1` calls (1.70x). |
| 34 | +* `p = 0.10`, `N = 40`: stateless `666.5` calls vs durable `44.4` calls (15.0x). |
40 | 35 |
|
41 | | -The gap grows with run length. At a 5% transient failure rate across 40 calls, the stateless restart model executes ~3.4x the work, on average, to get one successful completion. |
42 | | - |
43 | | -That "extra work" is not free retries. It is duplicated tool calls. If any of those tool calls write to external systems, you also created duplicated side effects unless you designed every tool integration to be idempotent. |
| 36 | +The ratio compounds because stateless retry forces the run to finish the entire chain without a single transient failure. Durable steps turn that into independent retries per call. |
44 | 37 |
|
45 | 38 | ## Core insight |
46 | 39 |
|
47 | | -Agent workflows compound failure probability. A single run touches many systems, and each system has its own tail latency and transient errors. |
48 | | - |
49 | | -Stateless runtimes give you one recovery primitive: re-execute code. That works for pure functions. Agents are not pure. They read and write external state, and they do it many times per run. |
50 | | - |
51 | | -Workflow DevKit uses a different recovery primitive: replay. The workflow function runs in a deterministic sandbox. Steps isolate side effects and persist their results to an append-only event log. On restart, the workflow replays and step calls resolve from recorded outputs. A transient failure retries only the failing step. |
| 40 | +In agent workloads, the expensive part is not the control flow. It is the tool boundary: API calls, database writes, emails, payments, rate-limited endpoints. Stateless retry replays those boundaries unless the application builds its own ledger of what already executed. |
52 | 41 |
|
53 | | -This is the practical difference between "retry is correct" and "retry is dangerous." |
| 42 | +That ledger is the same thing a durable runtime provides: an event log keyed by stable correlation ids. Workflow DevKit already emits a correlation id per step and records its lifecycle (`created`, `started`, `retrying`, `completed`, `failed`). Replay rehydrates the workflow and returns step results without re-executing successful calls. |
54 | 43 |
|
55 | 44 | ## Practical takeaway |
56 | 45 |
|
57 | | -If your agent touches external systems, treat every tool call as a durable step. Keep orchestration in the workflow and error policy in steps (`FatalError` for permanent failures, `RetryableError` for transient failures with backoff). |
| 46 | +Use durable steps for every side-effecting tool call. Keep the workflow function deterministic and let the runtime handle replay and selective retry. If a tool supports idempotency keys, derive the key from the step correlation id instead of inventing your own scheme. |
| 47 | + |
| 48 | +### Stateless retry duplicates work |
| 49 | + |
| 50 | +**Before: retrying an Agent turn replays the full prefix** |
| 51 | + |
| 52 | +```ts |
| 53 | +export async function agentTurn(input: Input) { |
| 54 | + for (let attempt = 1; attempt <= 5; attempt += 1) { |
| 55 | + try { |
| 56 | + const a = await toolA(input); |
| 57 | + const b = await toolB(a); |
| 58 | + const c = await toolC(b); |
| 59 | + return { a, b, c }; |
| 60 | + } catch (err) { |
| 61 | + if (attempt === 5) throw err; |
| 62 | + await sleepMs(1000 * attempt); |
| 63 | + } |
| 64 | + } |
| 65 | + throw new Error("unreachable"); |
| 66 | +} |
| 67 | +``` |
58 | 68 |
|
59 | | -```bash |
60 | | -npx workflow inspect run <run_id> |
| 69 | +**After: durable steps replay successful calls and retry only the failed one** |
| 70 | + |
| 71 | +```ts |
| 72 | +import { RetryableError } from "workflow"; |
| 73 | + |
| 74 | +async function toolA(input: Input) { 'use step'; return callA(input); } |
| 75 | +async function toolB(a: A) { 'use step'; return callB(a); } |
| 76 | +async function toolC(b: B) { |
| 77 | + 'use step'; |
| 78 | + const res = await callC(b); |
| 79 | + if (res.transient === true) throw new RetryableError("toolC transient", { retryAfter: "2s" }); |
| 80 | + return res; |
| 81 | +} |
| 82 | + |
| 83 | +export async function agentTurn(input: Input) { |
| 84 | + 'use workflow'; |
| 85 | + const a = await toolA(input); |
| 86 | + const b = await toolB(a); |
| 87 | + return await toolC(b); |
| 88 | +} |
61 | 89 | ``` |
62 | 90 |
|
63 | | ---- |
| 91 | +### Stop managing idempotency keys by hand |
64 | 92 |
|
65 | | -## Style justification |
| 93 | +**Before: generating and persisting idempotency keys across retries** |
66 | 94 |
|
67 | | -**What works extremely well:** |
68 | | -- Title follows the research post pattern perfectly — states the finding ("Stateless retries multiply agent side effects"), not the question. Compare to "AGENTS.md outperforms skills in our agent evals." |
69 | | -- The data table is the strongest element. It follows the Vercel research pattern: simplest possible table, one clear variable, ascending values that build to the headline result (3.39x overhead). The "AGENTS.md" post used the same structure. |
70 | | -- Mathematical formulas add genuine authority. This is not opinion — it is a derivable result. Vercel research posts lead with data, not argument. |
71 | | -- "This is a simplified model" matches the Vercel pattern of honest difficulty: acknowledge limitations as technical facts, not apologies. |
72 | | -- "retry is correct" vs. "retry is dangerous" — this closing line is the kind of quotable insight Vercel research posts end on. Compare to "Passive context beats active retrieval." |
73 | | -- Paragraphs are tight. The core insight section is 4 sentences, 3 paragraphs. Maximum density. |
| 95 | +```ts |
| 96 | +import { sql } from "./db"; |
| 97 | +import { randomUUID } from "crypto"; |
74 | 98 |
|
75 | | -**What could be stronger:** |
76 | | -- The methodology section describes a theoretical model, not empirical data. The "AGENTS.md" post ran actual evals with pass rates. A reader might ask "did you measure this on real agent runs?" Adding even one real-world data point (e.g., "across 1,000 production agent runs, we observed a 4.2% transient failure rate") would close that gap. |
77 | | -- The formulas assume geometric distributions and independence — reasonable but worth stating explicitly for a research-style post. The "simplified model" caveat partially covers this. |
78 | | -- No visual. Vercel research posts with tables often benefit from a chart showing the divergence curve. The 3.39x figure at 40 steps / 5% failure rate deserves a visual. |
| 99 | +export async function purchase(runId: string, userId: string) { |
| 100 | + const row = await sql`SELECT charge_key, email_key FROM runs WHERE id=${runId}`; |
| 101 | + const chargeKey = row.charge_key ?? randomUUID(); |
| 102 | + const emailKey = row.email_key ?? randomUUID(); |
| 103 | + await sql`UPDATE runs SET charge_key=${chargeKey}, email_key=${emailKey} WHERE id=${runId}`; |
| 104 | + await stripe.charges.create({ amount: 499, customer: userId }, { idempotencyKey: chargeKey }); |
| 105 | + await sendReceiptEmail(userId, { idempotencyKey: emailKey }); |
| 106 | +} |
| 107 | +``` |
| 108 | + |
| 109 | +**After: use the step correlation id as the idempotency key** |
| 110 | + |
| 111 | +```ts |
| 112 | +import { getStepMetadata } from "workflow"; |
| 113 | + |
| 114 | +async function chargeCard(userId: string, amount: number) { |
| 115 | + 'use step'; |
| 116 | + const { stepId } = getStepMetadata(); |
| 117 | + return stripe.charges.create({ amount, customer: userId }, { idempotencyKey: stepId }); |
| 118 | +} |
| 119 | +async function sendReceipt(userId: string) { |
| 120 | + 'use step'; |
| 121 | + const { stepId } = getStepMetadata(); |
| 122 | + await mailer.sendReceipt({ userId }, { idempotencyKey: stepId }); |
| 123 | +} |
| 124 | + |
| 125 | +export async function purchase(userId: string) { |
| 126 | + 'use workflow'; |
| 127 | + await chargeCard(userId, 499); |
| 128 | + await sendReceipt(userId); |
| 129 | +} |
| 130 | +``` |
79 | 131 |
|
80 | | -**Alternative approaches:** |
81 | | -1. **Empirical-first:** Run actual agent workloads on stateless vs. durable, measure tool call counts, and report observed overhead. Replace the theoretical model with production data. Harder to produce but more authoritative. |
82 | | -2. **Side-effect focused:** Narrow the article to side-effect duplication specifically. Title: "Agent retries duplicate side effects at scale." Drop the formula and instead catalog real failure scenarios: double-charged payments, duplicate tickets, repeated emails. More visceral for engineering readers. |
83 | | -3. **Comparative architecture:** Add a third column to the table — "stateless with manual checkpointing" — showing the engineering effort to approach durable performance without the framework. This makes the build-vs-buy argument explicit without stating it. |
| 132 | +```bash |
| 133 | +npx -y -p @workflow/cli wf inspect runs |
| 134 | +``` |
0 commit comments