Skip to content

fix: MCP tools intermittently unavailable after hibernation (#928)#996

Merged
threepointone merged 2 commits intomainfrom
chat-wait-for-mcp
Feb 27, 2026
Merged

fix: MCP tools intermittently unavailable after hibernation (#928)#996
threepointone merged 2 commits intomainfrom
chat-wait-for-mcp

Conversation

@threepointone
Copy link
Contributor

@threepointone threepointone commented Feb 26, 2026

Summary

Fixes #928 — MCP tools are intermittently unavailable in onChatMessage after Durable Object hibernation.

Also updates the ai-chat example to showcase MCP server management as a real-world demo of the fix.

The problem

When an AIChatAgent wakes from hibernation, onStart() calls restoreConnectionsFromStorage() which fires MCP server connections in the background (deliberately not awaited, to avoid blocking the DO). If a WebSocket message arrives before those connections finish, getAITools() returns an incomplete or empty tool set because the connections are still in "connecting" state.

The user sees this as:

[getAITools] WARNING: Reading tools from connection aik32tRf in state "connecting". Tools may not be loaded yet.

This is a race condition: sometimes connections finish before onChatMessage runs, sometimes they do not.


API

MCPClientManager.waitForConnections(options?)

Package: agents

New method on MCPClientManager that awaits all in-flight connection and discovery operations.

// Wait indefinitely for all connections to settle
await this.mcp.waitForConnections();

// Wait up to 10 seconds, then proceed regardless
await this.mcp.waitForConnections({ timeout: 10_000 });

// Return immediately with whatever is ready (timeout: 0 or negative)
await this.mcp.waitForConnections({ timeout: 0 });
  • Uses Promise.allSettled internally — never rejects, even if individual connections fail
  • Returns once every pending connection has connected+discovered, failed, or timed out
  • Resolves immediately if there are no pending connections
  • timeout: 0 or negative returns immediately (useful for "best effort" checks)
  • Safe to call concurrently from multiple callers (each snapshots the same pending promises)
  • Timeout uses Promise.race with clearTimeout cleanup — no leaked timers

AIChatAgent.waitForMcpConnections

Package: @cloudflare/ai-chat

New opt-in property on AIChatAgent that automatically waits before processing chat messages.

class MyAgent extends AIChatAgent<Env> {
  // Wait indefinitely for all MCP connections before onChatMessage
  waitForMcpConnections = true;

  // Or: wait up to 10 seconds
  waitForMcpConnections = { timeout: 10_000 };

  // Default: false (non-blocking, existing behavior preserved)
  waitForMcpConnections = false;
}

For lower-level control, call this.mcp.waitForConnections() directly inside onChatMessage instead.


Design decisions

1. Opt-in, not default

The wait is off by default (waitForMcpConnections = false). This preserves existing behavior — agents without MCP servers or agents that manage timing themselves are unaffected. Making it default-on would add latency to every chat message for all users, even those without MCP servers.

2. Track-and-wait pattern instead of blocking restore

An alternative was to await the connections directly in restoreConnectionsFromStorage. We rejected this because:

  • It would block onStart(), delaying the entire DO wake-up
  • Multiple callers might need the connections at different times
  • The track-and-wait pattern is more composable — any code path can call waitForConnections() when it needs tools to be ready

The implementation uses a _pendingConnections Map that tracks promises with automatic cleanup:

  • _trackConnection(serverId, promise) — private method that wraps the promise with .finally() that removes it from the map when settled
  • If a server is re-tracked (e.g., reconnect), the old promise's cleanup checks identity before deleting, preventing a newer promise from being orphaned
  • closeConnection() and closeAllConnections() clean up the map so waitForConnections() does not block on closed servers

3. Wait only on chat messages, not all WebSocket messages

The waitForMcpConnections wait runs only for CF_AGENT_USE_CHAT_REQUEST messages, not for state updates, RPC calls, or other WebSocket traffic. This avoids unnecessary latency on non-chat interactions after reconnect.

4. OAuth servers are excluded from tracking

Servers with auth_url set (OAuth flow in progress) are placed in "authenticating" state and are not tracked as pending connections. They require user interaction to complete, so waiting on them would block indefinitely. The handleMcpOAuthCallback path calls establishConnection() which self-tracks via _trackConnection internally.

5. establishConnection self-tracks

establishConnection(serverId) now tracks its own promise via _trackConnection internally and returns Promise<void> for callers to await. This means external callers (like handleMcpOAuthCallback) don't need to access the private _trackConnection method — they just call establishConnection().


Changes by file

Core fix (packages/agents)

File Change
src/mcp/client.ts Added _pendingConnections map, private _trackConnection(), waitForConnections(), establishConnection() self-tracks. Restore now tracks via _trackConnection. Cleanup in closeConnection/closeAllConnections.
src/index.ts handleMcpOAuthCallback calls establishConnection() (which self-tracks) and catches errors inline.

AIChatAgent integration (packages/ai-chat)

File Change
src/index.ts Added waitForMcpConnections property. Wait logic placed inside the chat-request branch of onMessage (not on all messages).

AI-chat example (examples/ai-chat)

File Change
src/server.ts Added waitForMcpConnections = true, @callable addServer/removeServer methods, MCP tools merged into onChatMessage via ...this.mcp.getAITools(), OAuth callback configuration in onStart().
src/client.tsx Added MCP server management dropdown in header (add/remove servers, OAuth authorize button, server state badges, tool count). onMcpUpdate callback for reactive updates. Message parts now render in stream order (reasoning, tools, text interleaved).

Tests

File What it tests
tests/mcp/client-manager.test.ts Unit tests: immediate resolve, tracked settle, mixed success/failure, cleanup, timeout, timeout=0/negative, early finish, concurrent callers, promise replacement identity, establishConnection self-tracking.
tests/mcp/wait-connections-e2e.test.ts 8 E2E tests against real DO stubs with SQLite: no-servers, restore-wait, race-condition demo, OAuth skip, timeout, 3 true hibernation round-trip tests through onStart().
tests/agents/wait-connections.ts Test agent with hibernationRoundTrip() and hibernationRoundTripNoWait() that exercise onStart -> restoreConnectionsFromStorage -> waitForConnections.
ai-chat/tests/wait-mcp-connections.test.ts 3 config plumbing tests: true, { timeout }, false variants all process messages correctly.
ai-chat/tests/worker.ts + wrangler.jsonc 3 new test agents (WaitMcpTrueAgent, WaitMcpTimeoutAgent, WaitMcpFalseAgent) + DO bindings.

Reviewer notes

  • _trackConnection is privateestablishConnection() self-tracks internally, so no external caller needs to access _trackConnection. The handleMcpOAuthCallback path in index.ts simply calls this.mcp.establishConnection().
  • timerId! non-null assertion in waitForConnections — the Promise constructor runs synchronously so timerId is always assigned before clearTimeout. TypeScript cannot verify this. Could restructure to avoid the assertion if desired.
  • The hibernation round-trip E2E tests use onStart() directly on the stub rather than going through actual WebSocket disconnect/reconnect. This is because @cloudflare/vitest-pool-workers does not support triggering real hibernation cycles. The tests prove the onStart -> _trackConnection -> waitForConnections pipeline works, which is the critical integration seam.
  • The changeset covers both agents (patch) and @cloudflare/ai-chat (patch).

How users fix issue #928

Before (intermittent failures):

async onChatMessage(onFinish, options) {
  const mcpTools = this.mcp.getAITools(); // <- sometimes empty after hibernation
}

After (option A — declarative):

class MyAgent extends AIChatAgent<Env> {
  waitForMcpConnections = true; // or { timeout: 10_000 }

  async onChatMessage(onFinish, options) {
    const mcpTools = this.mcp.getAITools(); // <- always complete
  }
}

After (option B — imperative):

async onChatMessage(onFinish, options) {
  await this.mcp.waitForConnections();
  const mcpTools = this.mcp.getAITools(); // <- always complete
}

@changeset-bot
Copy link

changeset-bot bot commented Feb 26, 2026

🦋 Changeset detected

Latest commit: 24cb519

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 2 packages
Name Type
agents Patch
@cloudflare/ai-chat Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@pkg-pr-new
Copy link

pkg-pr-new bot commented Feb 26, 2026

Open in StackBlitz

npm i https://pkg.pr.new/cloudflare/agents@996
npm i https://pkg.pr.new/cloudflare/agents/@cloudflare/ai-chat@996
npm i https://pkg.pr.new/cloudflare/agents/@cloudflare/codemode@996
npm i https://pkg.pr.new/cloudflare/agents/hono-agents@996

commit: 24cb519

Fix a race where MCP tools could be unavailable in onChatMessage after Durable Object hibernation by tracking background connection work and providing a way to await it.

- Add MCPClientManager._pendingConnections, _trackConnection(), and waitForConnections({ timeout? }) to await all in-flight connection/discovery promises (uses allSettled and optional timeout). Pending entries are cleaned when settled or removed on close.
- Agent now tracks background establishConnection promises from OAuth callback via this.mcp._trackConnection so callers can wait for those restores.
- Add waitForMcpConnections config to AIChatAgent (false by default). When enabled (true or { timeout }), AIChatAgent waits for mcp.waitForConnections() before calling onChatMessage.
- Add tests and E2E coverage (new test agents, wait-connections tests, and wrangler test bindings) to validate behavior and timeouts.

This change preserves prior non-blocking behavior by default while offering opt-in safety for callers that require MCP tools to be ready.
Add end-to-end MCP server management and improve connection tracking. Client: introduce an MCP dropdown panel in the ai-chat UI to add/remove/authenticate MCP servers, show server/tool counts and states, and add richer assistant rendering for message parts (text, reasoning, tool states). Server: ChatAgent now waits for MCP connections after hibernation, configures an OAuth callback UI, exposes callable addServer/removeServer methods, and merges MCP tools into the agent toolset. agents/mcp: make _trackConnection private and have establishConnection self-track (delegates to a new _doEstablishConnection), and refine waitForConnections timeout behavior (0 or negative returns immediately). Tests updated to use a test helper to call the private tracker and include new cases for self-tracking and timeout behavior. Minor cleanup: adjust Agent OAuth callback handling to use establishConnection directly.
@threepointone threepointone merged commit baf6751 into main Feb 27, 2026
4 checks passed
@threepointone threepointone deleted the chat-wait-for-mcp branch February 27, 2026 15:36
@github-actions github-actions bot mentioned this pull request Feb 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

MCP tools intermittently unavailable in onChatMessage

1 participant