Skip to content

Commit 43aa564

Browse files
authored
fix: Make dev server resilient to dependency re-optimization (#832)
This addresses two distinct but related sources of instability in the Vite dev server, both of which are triggered by Vite's dependency re-optimization process. ## Problem 1: Module-Level State is Discarded on Re-optimization The framework's runtime relies on long-lived, module-level state for critical features like request context tracking via `AsyncLocalStorage`. However, Vite's dependency re-optimization process, designed for browser-based hot reloading, is fundamentally incompatible with this. When a new dependency is discovered, Vite discards and re-instantiates the entire module graph. This process wipes out our module-level state, leading to unpredictable runtime errors and application crashes. ### Solution: A Virtual State Module A virtual state module, `rwsdk/__state`, is introduced to act as a centralized, persistent store for framework-level state. - A new Vite plugin (`statePlugin`) marks this virtual module as `external` to Vite's dependency optimizer for the worker environment. This insulates it from the re-optimization and reload process. - The plugin resolves `rwsdk/__state` to a physical module (the built module in `dist/` for `sdk/src/runtime/state.ts`) that contains the state container and management APIs - i.e. we bypass the dep optimizer for this specific module, so we have a stable path outside of dep optimization bundles - Framework code is refactored to use this module (e.g., `defineRwState(...)`), making the state resilient to reloads. This solves the state-loss problem and centralizes state management within the framework. ## Problem 2: Race Conditions Cause "Stale Pre-bundle" Errors In a standard Vite setup, handling stale dependencies is a routine process. When a re-optimization occurs, the browser might request a module with an old version hash. The Vite server correctly throws a "stale pre-bundle" error, which is caught by Vite's client-side script in the browser. This script then automatically retries the request or performs a full page reload, seamlessly recovering from the transient error. However, our architecture introduces several layers of complexity that make this standard recovery model insufficient. The "client" making these requests is not a browser, but the Cloudflare `CustomModuleRunner` executing server-side within Miniflare. Furthermore, our **SSR Bridge** architecture means this runner interacts with a virtual module subgraph. When it needs to render an SSR component, it makes a request for a virtual module which, via our plugin, triggers a server-to-server `fetchModule` call from the `worker` environment to the isolated `ssr` environment. This unique, cross-environment request pattern for virtual modules is at the heart of the instability. When a re-optimization happens in the `ssr` environment, the standard recovery mechanisms are not equipped to handle the resulting state desynchronization. The failure manifests as a perfect storm of three deeper, interconnected issues: 1. **Stale Resolution:** After an SSR re-optimization, Vite's internal resolver would continue to use a stale "ghost node" from its module graph to resolve our virtual `ssr_bridge` module, leading to a request for a dependency with an old, invalid version hash. 2. **Desynchronized Environments:** The `full-reload` HMR event triggered by the SSR optimizer was not being propagated to the worker environment. This meant the worker's own caches (especially the `CustomModuleRunner`'s execution cache) were never cleared and continued to use stale modules, creating an infinite error loop. 3. **Premature Re-import:** Even with synchronized invalidation, the worker's module runner re-imports its entry points immediately after clearing its cache. This happens too quickly, hitting the Vite server while its own internal state is still being updated, re-triggering the stale dependency error. ### Solution: A Multi-Layered Approach to Synchronization and Stability A combination of fixes was implemented to address this race condition: 1. **Manual Hash Resolution:** The `ssrBridgePlugin` was modified to no longer rely on Vite's internal, faulty resolution for virtual modules. It now manually resolves the correct, up-to-date version hash for any optimized dependency from the SSR optimizer's metadata before fetching it. This bypasses the "ghost node" problem. 2. **HMR Propagation:** The `ssrBridgePlugin` now intercepts the `full-reload` HMR event from the SSR environment and propagates it to the worker environment. This ensures the worker's module runner and module graph are correctly invalidated when the SSR environment changes. 3. **Debounced Stability Plugin (`staleDepRetryPlugin`):** A new error-handling middleware was introduced. When it catches the inevitable "stale pre-bundle" error from the runner's premature re-import, it does not immediately retry. Instead, it waits for the server to become "stable" by monitoring the `transform` hook. Once a quiet period with no module transformation activity is detected, it signals the client to perform a full reload and gracefully redirects the failed request.
1 parent 4b9a483 commit 43aa564

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

53 files changed

+12799
-1107
lines changed

.github/workflows/playground-e2e-tests.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -51,7 +51,7 @@ on:
5151

5252
concurrency:
5353
group: ${{ github.workflow }}-${{ github.ref }}
54-
cancel-in-progress: ${{ github.event_name == 'pull_request' }}
54+
cancel-in-progress: ${{ github.event_name == 'pull_request_target' }}
5555

5656
env:
5757
CLOUDFLARE_ACCOUNT_ID: ${{ secrets.CLOUDFLARE_ACCOUNT_ID }}

.github/workflows/smoke-test.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -51,7 +51,7 @@ on:
5151

5252
concurrency:
5353
group: ${{ github.workflow }}-${{ github.ref }}
54-
cancel-in-progress: ${{ github.event_name == 'pull_request' }}
54+
cancel-in-progress: ${{ github.event_name == 'pull_request_target' }}
5555

5656
env:
5757
CLOUDFLARE_ACCOUNT_ID: ${{ secrets.CLOUDFLARE_ACCOUNT_ID }}

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,8 @@
22

33
.rwsync.lock
44

5+
.tmp/
6+
57
logs
68
_.log
79
npm-debug.log_

.notes/justin/worklogs/2025-10-13-resilient-module-state-in-dev.md

Lines changed: 1425 additions & 0 deletions
Large diffs are not rendered by default.

CONTRIBUTING.md

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -187,6 +187,17 @@ You can also specify a package manager or enable debug logging using environment
187187
PACKAGE_MANAGER="yarn" DEBUG='rwsdk:e2e:environment' pnpm test:e2e hello-world/__tests__/e2e.test.mts
188188
```
189189
190+
#### Local Development Performance
191+
192+
To speed up the local test-and-debug cycle, the E2E test harness uses a caching mechanism that is **enabled by default** for local runs.
193+
194+
- **How it Works**: The harness creates a persistent test environment in your system's temporary directory for each playground project. On the first run, it installs all dependencies. On subsequent runs, it reuses this environment, skipping the lengthy installation step. The cache is automatically disabled in CI environments.
195+
- **Disabling the Cache**: If you need to force a clean install, you can disable the cache by setting the `RWSDK_E2E_CACHE` environment variable to `0`:
196+
```sh
197+
RWSDK_E2E_CACHE=0 pnpm test:e2e
198+
```
199+
- **Cache Invalidation**: If you change a playground's `package.json`, you will need to manually clear the cache for that playground to force a re-installation. The cache directories are located in your system's temporary folder (e.g., `/tmp/rwsdk-e2e-cache` on Linux).
200+
190201
#### Skipping Tests
191202
192203
You can skip dev server or deployment tests using environment variables. This is useful for focusing on a specific part of the test suite.

docs/architecture/devServerDependencyOptimization.md

Lines changed: 18 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -24,27 +24,31 @@ The sequence of events was as follows:
2424

2525
This happened because the initial optimization pass was only aware of third-party `node_modules` dependencies; it had no knowledge of the application's internal dependency graph.
2626

27-
## The Solution: A Unified, Proactive Scan
27+
## The Solutions
2828

29-
The solution is a unified strategy that proactively scans the *entire* dependency graph—both third-party and application code—and feeds this complete picture to Vite at startup. This solves both problems at once by ensuring all dependencies are discovered before they are needed, eliminating both request waterfalls and re-optimization triggers.
29+
The solution is a two-pronged strategy. First, a proactive dependency scan solves the performance problem and reduces the frequency of re-optimizations. Second, a virtual state module provides true resilience against the state-loss that occurs when a re-optimization is unavoidable.
3030

31-
This strategy has three main parts.
31+
### Solution 1: Proactive Scanning to Prevent Waterfalls and Reduce Re-Optimizations
3232

33-
### 1. A Standalone `esbuild` Scan
33+
The first part of the solution is a unified strategy that proactively scans the *entire* dependency graph—both third-party and application code—and feeds this complete picture to Vite at startup. This solves the performance problem by ensuring all dependencies are discovered before they are needed, eliminating request waterfalls. It also mitigates the stability problem by making re-optimizations much less frequent, as the optimizer has a more complete picture of the graph from the outset.
3434

35-
The core of the solution is our own, separate `esbuild` scan that runs before Vite's `optimizeDeps` process begins. This scan traverses the application's entire dependency graph to create a definitive list of all modules.
35+
However, this proactive scan cannot account for dependencies that are truly new, such as when a developer adds an import to a new package or module mid-session. When this happens, a re-optimization is still triggered, which leads to the second part of the solution.
36+
37+
#### 1. A Standalone `esbuild` Scan
38+
39+
The core of this strategy is our own, separate `esbuild` scan that runs before Vite's `optimizeDeps` process begins. This scan traverses the application's entire dependency graph to create a definitive list of all modules.
3640

3741
The scanner's most critical feature is its custom, Vite-aware module resolver, which ensures its dependency traversal perfectly mimics the application's actual runtime behavior, correctly handling complex project configurations like TypeScript path aliases.
3842

3943
For a detailed explanation of the scanner's implementation and the rationale behind its design, see the [Directive Scanning and Module Resolution](./directiveScanningAndResolution.md) documentation.
4044

41-
### 2. The "Barrel File" Strategy to Inform the Optimizer
45+
#### 2. The "Barrel File" Strategy to Inform the Optimizer
4246

4347
Instead of feeding hundreds of individual files to `optimizeDeps`, we consolidate them into **"barrel files."** We create separate barrels for third-party dependencies (which we refer to as **vendor barrels**) and for the application's own source code.
4448

4549
This approach works *with* the bundler's expectations. By providing a small, consolidated list of entry points (the barrel files), we signal a complete and interconnected dependency graph. This allows `esbuild` to perform an efficient, comprehensive optimization pass that avoids both excessive chunking and the need for later re-optimization.
4650

47-
### 3. Synchronized Execution and Assertive Resolution
51+
#### 3. Synchronized Execution and Assertive Resolution
4852

4953
A final challenge is the timing and execution of this process within Vite's lifecycle. Vite starts many processes in parallel, creating potential race conditions. Furthermore, Vite's dependency scanner is designed to treat application code as "external" by default, meaning it won't scan it for dependencies.
5054

@@ -56,4 +60,10 @@ We solve this with a hybrid blocking and resolution strategy:
5660

5761
3. **Assertive Resolution:** The same `esbuild` plugin intercepts resolution requests for the application's own source files. It then explicitly returns a resolution result, claiming the file and signaling that it is *internal* code that must be scanned for dependencies. This preempts Vite's default behavior and ensures the entire application graph is traversed.
5862

59-
This approach provides a stable and performant development environment by ensuring Vite has a complete dependency graph from the outset, balancing perceived startup performance with the correctness required to prevent disruptive re-optimizations.
63+
### Solution 2: A Virtual State Module for Resilient State
64+
65+
To solve the module state loss problem definitively, the framework introduces a centralized, virtual state module that is insulated from Vite's re-optimization process. This module, identified by the specifier `rwsdk/__state`, acts as the single, persistent source of truth for all critical framework-level state.
66+
67+
A dedicated Vite plugin is responsible for managing this module. Its primary job is to mark `rwsdk/__state` as "external" to Vite's dependency optimizer for the `worker` environment. This simple but critical step prevents the state module from being included in the dependency graph that Vite reloads. When a re-optimization occurs, all other application and framework modules are re-instantiated, but the virtual state module remains untouched, preserving its state across the reload.
68+
69+
This approach directly solves the state-loss problem, making features that rely on module-level state (like `AsyncLocalStorage` for request context) resilient to dependency changes during development. It also encourages a more organized approach to state management within the framework by providing a central, explicit location for all shared state.
Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
# Architecture: Dev Server Stability
2+
3+
## The Challenge: Unstable Server Renders on Re-optimization
4+
5+
Vite's dependency re-optimization is a core feature of its development server. When a new, previously undiscovered import is added to the codebase, Vite automatically pre-bundles it and reloads the browser to ensure a consistent state. This process is generally seamless for client-side code, as Vite's client script handles the `full-reload` Hot Module Replacement (HMR) event gracefully.
6+
7+
However, this standard recovery model does not apply to code executing on the server, specifically within our framework's `worker` environment. The "client" in this context is not a browser, but the Cloudflare `CustomModuleRunner` executing inside Miniflare. This server-side runner does not have the same built-in recovery logic as a browser.
8+
9+
When a re-optimization is triggered by a module used during a server-render (e.g., inside an SSR'd component or a server action), the system enters an unstable state. Without a robust recovery mechanism designed for this server-side context, this can lead to crashes, hangs, and a frustrating developer experience. The challenge, therefore, is to create a recovery system that makes re-optimization events as seamless for our server environment as they are for a browser.
10+
11+
## The Solution: A Server-Side Recovery System
12+
13+
To solve this, the framework implements a multi-layered system that creates a robust recovery process for the server environment. Building this system required overcoming several technical hurdles that arise from our use of two interconnected Vite environments (`worker` and `ssr`).
14+
15+
### Hurdle 1: Stale Resolution from Cached Module Nodes
16+
17+
Vite's module graph caches a representation of every processed module in a `ModuleNode` object. When a re-optimization occurs, Vite's standard invalidation process sets a flag on these nodes but does not fully remove them, leaving behind a "ghost node". This ghost node retains some old information, including the module's previously resolved ID (e.g., a path with an old version hash).
18+
19+
This creates a problem for our SSR Bridge. When the bridge requests a module by its clean, un-hashed name, Vite's resolver can find this ghost node and, as a shortcut, re-use its stale, version-hashed ID instead of performing a fresh resolution. This leads to a request for an outdated dependency.
20+
21+
**Solution:** The `ssrBridgePlugin` employs **Proactive Hash Resolution**. It avoids this faulty lookup by not relying on Vite's internal resolver for virtual modules. Instead, it proactively determines the correct, up-to-date version hash for any optimized dependency by looking directly at the SSR optimizer's metadata.
22+
23+
### Hurdle 2: Desynchronized Environment Caches
24+
25+
The `worker` and `ssr` environments are isolated; by default, an HMR event in one does not affect the other. This architectural separation becomes a problem during re-optimization. If the `ssr` environment re-optimizes and resets its state, the `worker` environment remains unaware, leaving its own caches (both Vite's module graph and the Cloudflare runner's execution cache) in a stale and inconsistent state.
26+
27+
**Solution:** The `ssrBridgePlugin` is responsible for **Cross-Environment HMR Propagation**. It bridges this gap by intercepting `full-reload` events from the SSR environment's HMR channel and forwarding them to the worker's channel. This ensures that when the `ssr` environment resets, the `worker` environment is also instructed to invalidate its caches in lockstep.
28+
29+
### Hurdle 3: Race Conditions on Re-import
30+
31+
The `CustomModuleRunner` is designed to re-import its entry points immediately after receiving a `full-reload` event. This happens too quickly, hitting the Vite server before it has finished stabilizing, which re-triggers a "stale pre-bundle" error. This necessitates a final safeguard that can gracefully handle this predictable race condition.
32+
33+
### The Debounced Redirect-and-Retry Mechanism
34+
35+
The solution is a final safeguard in the form of an error-handling middleware (`staleDepRetryPlugin`) that performs a **Debounced Retry**. When it catches the predictable "stale pre-bundle" error, it does not immediately retry. Instead, it waits for the server to become "stable" by monitoring Vite's `transform` hook for a period of inactivity.
36+
37+
Once the server is stable, it performs two actions:
38+
1. **Triggers a client-side reload:** A `full-reload` HMR message is sent to the browser.
39+
2. **Redirects the failed request:** It responds to the original request with a `307 Temporary Redirect`.
40+
41+
This redirect was chosen over a transparent, server-side retry for two key reasons:
42+
1. **Technical Feasibility:** A transparent retry for requests with bodies (e.g., `POST` for server actions) is not possible without buffering the request body in advance, an approach that was rejected for performance and dev/prod parity reasons.
43+
2. **Architectural Safety:** Transparently retrying `POST` requests is risky, as it could cause non-idempotent actions to execute twice.
44+
45+
The `307` redirect forces the client to re-issue the request against a now-stable server. This makes it a simple and universal recovery mechanism that handles all types of requests consistently, whether the original request was for a full HTML document (for pages with or without client-side JS), a `fetch` request from a client-side interaction, or a non-browser request from within the worker itself. While this can result in a "no op" result for the first click for client-side interactions (if reoptimization needed to happen when the interaction happened), its robustness and simplicity make it the most pragmatic choice.

docs/architecture/endToEndTesting.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -96,5 +96,7 @@ pnpm smoke-test --path=../starter
9696
To run the playground E2E tests:
9797
```
9898
pnpm test:e2e
99+
99100
```
100101
This hybrid architecture provides a fast, reliable, and scalable foundation for the E2E test suite, allowing for both high-performance and high-isolation testing of RedwoodSDK's features.
102+

docs/architecture/index.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,9 @@ This collection of documents provides a high-level overview of the core architec
1111
- [**The SSR Bridge**](./ssrBridge.md)
1212
Details the architecture that allows the framework to support two different rendering environments (RSC and traditional SSR) within a single Cloudflare Worker. It explains how the "SSR Bridge" uses Vite's Environments API to manage conflicting dependency requirements between the two runtimes.
1313

14+
- [**Dev Server Stability**](./devServerStability.md)
15+
Explains the multi-layered system that ensures a stable development experience, detailing how the framework handles race conditions and state desynchronization during Vite's dependency re-optimization process.
16+
1417
- [**Directive Scanning and Module Resolution**](./directiveScanningAndResolution.md)
1518
Details the internal `esbuild`-based scanner used to discover `"use client"` and `"use server"` directives, and the context-aware module resolution it employs to handle conditional exports correctly.
1619

0 commit comments

Comments
 (0)