fix: MCP tools intermittently unavailable after hibernation (#928)#996
Merged
threepointone merged 2 commits intomainfrom Feb 27, 2026
Merged
fix: MCP tools intermittently unavailable after hibernation (#928)#996threepointone merged 2 commits intomainfrom
threepointone merged 2 commits intomainfrom
Conversation
🦋 Changeset detectedLatest commit: 24cb519 The changes in this PR will be included in the next version bump. This PR includes changesets to release 2 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
commit: |
Fix a race where MCP tools could be unavailable in onChatMessage after Durable Object hibernation by tracking background connection work and providing a way to await it.
- Add MCPClientManager._pendingConnections, _trackConnection(), and waitForConnections({ timeout? }) to await all in-flight connection/discovery promises (uses allSettled and optional timeout). Pending entries are cleaned when settled or removed on close.
- Agent now tracks background establishConnection promises from OAuth callback via this.mcp._trackConnection so callers can wait for those restores.
- Add waitForMcpConnections config to AIChatAgent (false by default). When enabled (true or { timeout }), AIChatAgent waits for mcp.waitForConnections() before calling onChatMessage.
- Add tests and E2E coverage (new test agents, wait-connections tests, and wrangler test bindings) to validate behavior and timeouts.
This change preserves prior non-blocking behavior by default while offering opt-in safety for callers that require MCP tools to be ready.
Add end-to-end MCP server management and improve connection tracking. Client: introduce an MCP dropdown panel in the ai-chat UI to add/remove/authenticate MCP servers, show server/tool counts and states, and add richer assistant rendering for message parts (text, reasoning, tool states). Server: ChatAgent now waits for MCP connections after hibernation, configures an OAuth callback UI, exposes callable addServer/removeServer methods, and merges MCP tools into the agent toolset. agents/mcp: make _trackConnection private and have establishConnection self-track (delegates to a new _doEstablishConnection), and refine waitForConnections timeout behavior (0 or negative returns immediately). Tests updated to use a test helper to call the private tracker and include new cases for self-tracking and timeout behavior. Minor cleanup: adjust Agent OAuth callback handling to use establishConnection directly.
fcd7544 to
24cb519
Compare
Merged
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #928 — MCP tools are intermittently unavailable in
onChatMessageafter Durable Object hibernation.Also updates the ai-chat example to showcase MCP server management as a real-world demo of the fix.
The problem
When an
AIChatAgentwakes from hibernation,onStart()callsrestoreConnectionsFromStorage()which fires MCP server connections in the background (deliberately not awaited, to avoid blocking the DO). If a WebSocket message arrives before those connections finish,getAITools()returns an incomplete or empty tool set because the connections are still in"connecting"state.The user sees this as:
This is a race condition: sometimes connections finish before
onChatMessageruns, sometimes they do not.API
MCPClientManager.waitForConnections(options?)Package:
agentsNew method on
MCPClientManagerthat awaits all in-flight connection and discovery operations.Promise.allSettledinternally — never rejects, even if individual connections failtimeout: 0or negative returns immediately (useful for "best effort" checks)Promise.racewithclearTimeoutcleanup — no leaked timersAIChatAgent.waitForMcpConnectionsPackage:
@cloudflare/ai-chatNew opt-in property on
AIChatAgentthat automatically waits before processing chat messages.For lower-level control, call
this.mcp.waitForConnections()directly insideonChatMessageinstead.Design decisions
1. Opt-in, not default
The wait is off by default (
waitForMcpConnections = false). This preserves existing behavior — agents without MCP servers or agents that manage timing themselves are unaffected. Making it default-on would add latency to every chat message for all users, even those without MCP servers.2. Track-and-wait pattern instead of blocking restore
An alternative was to
awaitthe connections directly inrestoreConnectionsFromStorage. We rejected this because:onStart(), delaying the entire DO wake-upwaitForConnections()when it needs tools to be readyThe implementation uses a
_pendingConnectionsMap that tracks promises with automatic cleanup:_trackConnection(serverId, promise)— private method that wraps the promise with.finally()that removes it from the map when settledcloseConnection()andcloseAllConnections()clean up the map sowaitForConnections()does not block on closed servers3. Wait only on chat messages, not all WebSocket messages
The
waitForMcpConnectionswait runs only forCF_AGENT_USE_CHAT_REQUESTmessages, not for state updates, RPC calls, or other WebSocket traffic. This avoids unnecessary latency on non-chat interactions after reconnect.4. OAuth servers are excluded from tracking
Servers with
auth_urlset (OAuth flow in progress) are placed in"authenticating"state and are not tracked as pending connections. They require user interaction to complete, so waiting on them would block indefinitely. ThehandleMcpOAuthCallbackpath callsestablishConnection()which self-tracks via_trackConnectioninternally.5.
establishConnectionself-tracksestablishConnection(serverId)now tracks its own promise via_trackConnectioninternally and returnsPromise<void>for callers to await. This means external callers (likehandleMcpOAuthCallback) don't need to access the private_trackConnectionmethod — they just callestablishConnection().Changes by file
Core fix (
packages/agents)src/mcp/client.ts_pendingConnectionsmap, private_trackConnection(),waitForConnections(),establishConnection()self-tracks. Restore now tracks via_trackConnection. Cleanup incloseConnection/closeAllConnections.src/index.tshandleMcpOAuthCallbackcallsestablishConnection()(which self-tracks) and catches errors inline.AIChatAgent integration (
packages/ai-chat)src/index.tswaitForMcpConnectionsproperty. Wait logic placed inside the chat-request branch ofonMessage(not on all messages).AI-chat example (
examples/ai-chat)src/server.tswaitForMcpConnections = true,@callableaddServer/removeServermethods, MCP tools merged intoonChatMessagevia...this.mcp.getAITools(), OAuth callback configuration inonStart().src/client.tsxonMcpUpdatecallback for reactive updates. Message parts now render in stream order (reasoning, tools, text interleaved).Tests
tests/mcp/client-manager.test.tstests/mcp/wait-connections-e2e.test.tsonStart().tests/agents/wait-connections.tshibernationRoundTrip()andhibernationRoundTripNoWait()that exerciseonStart -> restoreConnectionsFromStorage -> waitForConnections.ai-chat/tests/wait-mcp-connections.test.tstrue,{ timeout },falsevariants all process messages correctly.ai-chat/tests/worker.ts+wrangler.jsoncWaitMcpTrueAgent,WaitMcpTimeoutAgent,WaitMcpFalseAgent) + DO bindings.Reviewer notes
_trackConnectionis private —establishConnection()self-tracks internally, so no external caller needs to access_trackConnection. ThehandleMcpOAuthCallbackpath inindex.tssimply callsthis.mcp.establishConnection().timerId!non-null assertion inwaitForConnections— the Promise constructor runs synchronously sotimerIdis always assigned beforeclearTimeout. TypeScript cannot verify this. Could restructure to avoid the assertion if desired.onStart()directly on the stub rather than going through actual WebSocket disconnect/reconnect. This is because@cloudflare/vitest-pool-workersdoes not support triggering real hibernation cycles. The tests prove theonStart -> _trackConnection -> waitForConnectionspipeline works, which is the critical integration seam.agents(patch) and@cloudflare/ai-chat(patch).How users fix issue #928
Before (intermittent failures):
After (option A — declarative):
After (option B — imperative):