Skip to content

Fix server hang after "Running scheduled diagnostics refresh"#3628

Open
Copilot wants to merge 13 commits intomainfrom
copilot/fix-server-unresponsiveness
Open

Fix server hang after "Running scheduled diagnostics refresh"#3628
Copilot wants to merge 13 commits intomainfrom
copilot/fix-server-unresponsiveness

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 27, 2026

Two unbounded waits on the critical path fired by Running scheduled diagnostics refresh can deadlock the server when the client is slow to respond, leaving the LSP completely unresponsive until manually restarted.

RefreshDiagnostics — fire and forget

  • workspace/diagnostic/refresh always returns null; waiting for it is unnecessary
  • Changed to send the request without registering a response channel or blocking
  • Late client responses are silently dropped by the read loop (no matching pending entry)
  • Short-circuits on context cancellation before sending

updateWatch — per-request 1s timeout on client requests with rollback

  • WatchFiles/UnwatchFiles calls hold session.watchesMu while synchronously waiting on client/registerCapability responses
  • Any client delay blocks all other updateWatch callers behind the mutex, and if requestQueue fills up, the read loop can't deliver responses — creating a self-locking deadlock
  • Each WatchFiles/UnwatchFiles call gets its own 1s context.WithTimeout so earlier calls don't consume the deadline for later ones
  • Watchers are deduped by key before calling Acquire to prevent refcount inflation from duplicate patterns
  • On any WatchFiles failure, ALL newly-acquired watchers in that batch are rolled back (not just the failed one) to keep refcounts clean for retry
  • Failed watcher IDs are tracked as "pending" in the registry so updateWatches retries them on the next snapshot update, even when the watcher identity hasn't changed

watchRegistry — encapsulated watch bookkeeping

  • Extracted a watchRegistry type into internal/project/watch.go that encapsulates the ref-counted map, mutex, and provides Acquire and Release methods
  • Added MarkPending, ClearPending, and IsPending methods to track watcher IDs whose registration failed
  • All methods consistently require the caller to hold mu, making the locking discipline explicit and uniform
  • updateWatch now delegates all bookkeeping to the registry instead of manipulating the map directly, making the function easier to follow
  • updateWatches checks IsPending (with explicit lock/unlock around each check) for all watcher types (config, program files, typings, auto-imports) and retries as fresh adds when the watcher identity is unchanged but the previous registration failed

Tests

  • Added TestUpdateWatchTimeoutAndRollback in internal/project/watchtimeout_test.go using synctest and projecttestutil to verify timeout, rollback, and retry behavior with real session operations
  • Uses DidOpenFile to open a file (triggering watches that time out with a mocked slow client), then DidChangeFile with a single character edit (which does not change any watcher identity), followed by GetLanguageService to flush the pending change and trigger updateWatches
  • Verifies that the pending tracking causes updateWatches to retry the failed registrations and that the retry succeeds, even though the watcher identities are unchanged
  • Asserts that retry watcher IDs match the initial attempt IDs and that at least one retried watcher succeeds

Copilot AI and others added 2 commits April 27, 2026 21:41
- Make RefreshDiagnostics fire-and-forget: send the request but don't
  wait for a response (client always returns null). Any response from
  the client is silently ignored by the read loop.
- Add 1s timeout to updateWatch for WatchFiles/UnwatchFiles client
  requests to prevent holding watchesMu indefinitely if the client
  is slow or unresponsive.

Agent-Logs-Url: https://github.com/microsoft/typescript-go/sessions/9d82d127-ea83-455b-bd36-297c9fe864b7

Co-authored-by: andrewbranch <3277153+andrewbranch@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix server stops responding after running scheduled diagnostics refresh Fix server hang after "Running scheduled diagnostics refresh" Apr 27, 2026
Copilot AI requested a review from andrewbranch April 27, 2026 21:45
@andrewbranch andrewbranch marked this pull request as ready for review April 28, 2026 00:44
Copilot AI review requested due to automatic review settings April 28, 2026 00:44
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Addresses an LSP server hang triggered by “Running scheduled diagnostics refresh” by removing an unnecessary client-response wait and bounding how long watch registration holds a critical mutex.

Changes:

  • Make RefreshDiagnostics send workspace/diagnostic/refresh as fire-and-forget (no response channel / no wait).
  • Add a 1s timeout for WatchFiles/UnwatchFiles client requests made while holding session.watchesMu.
  • Introduce sendClientRequestFireAndForget helper for non-essential client responses.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
internal/project/session.go Adds a watch request timeout context for client watcher register/unregister calls to avoid indefinite mutex holds.
internal/lsp/server.go Changes diagnostics refresh to fire-and-forget and adds a helper to send client requests without awaiting responses.

Comment thread internal/project/session.go
Comment thread internal/lsp/server.go
Comment thread internal/lsp/server.go
Comment thread internal/project/session.go Outdated
@andrewbranch
Copy link
Copy Markdown
Member

@copilot fix review comments

…lock, rollback + retry on WatchFiles failure

Agent-Logs-Url: https://github.com/microsoft/typescript-go/sessions/b1cea66d-c309-4c26-97cc-f815ab98e5fe

Co-authored-by: andrewbranch <3277153+andrewbranch@users.noreply.github.com>
Copy link
Copy Markdown
Contributor Author

Copilot AI commented Apr 28, 2026

@copilot fix review comments

Addressed all review comments in e2074f9e:

  • RefreshDiagnostics now checks ctx.Err() before sending
  • Timeout context is created after acquiring watchesMu
  • WatchFiles failures roll back session.watches bookkeeping and schedule a single retry after 1s

…apshot updates re-attempt naturally

Agent-Logs-Url: https://github.com/microsoft/typescript-go/sessions/02f6f16f-047c-455f-981d-b83603fc1df8

Co-authored-by: andrewbranch <3277153+andrewbranch@users.noreply.github.com>
Comment thread internal/project/session.go Outdated
Copy link
Copy Markdown
Contributor Author

Copilot AI commented Apr 28, 2026

@copilot fix review comments

All review comments have been addressed across e2074f9e, ccd56581, and 60619b9d.

Comment thread internal/project/watch.go Outdated
@andrewbranch
Copy link
Copy Markdown
Member

@copilot write a test for the updateWatch timeout and retry. I think you should be able to mock the client implementation using synctest.

Copy link
Copy Markdown
Contributor Author

Copilot AI commented Apr 28, 2026

@copilot write a test for the updateWatch timeout and retry. I think you should be able to mock the client implementation using synctest.

Added in b8769928. Two synctest subtests: one verifies that a slow client triggers timeout + rollback (registry empty after), and another verifies a subsequent call succeeds (simulating the next snapshot update re-attempting registration).

Comment thread internal/project/watch_test.go Outdated
Copy link
Copy Markdown
Contributor Author

Copilot AI commented Apr 28, 2026

@copilot write a test for the updateWatch timeout and retry. I think you should be able to mock the client implementation using synctest.

Rewrote the test using real session operations in 9e7b395d. The test uses synctest with projecttestutil.GetSessionInitOptions and mocks WatchFilesFunc to block on the first batch. It opens a file in p1 (WatchFiles times out and rolls back), then closes p1 and opens p2 (WatchFiles succeeds), verifying the rollback left the registry clean for re-registration.

Comment thread internal/project/watchtimeout_test.go Outdated
Comment thread internal/project/watchtimeout_test.go Outdated
…hanged identity

When a WatchFiles call times out, the watcher ID is now marked as pending
in the watchRegistry. On the next snapshot update, updateWatches checks for
pending watchers and retries them as fresh adds, even when the watcher
identity hasn't changed between snapshots.

Rewrote the test to use a single character DidChangeFile edit (no watcher
identity change) followed by GetLanguageService to flush the change and
verify the retry succeeds.

Agent-Logs-Url: https://github.com/microsoft/typescript-go/sessions/85f8d007-9ec1-47fa-bf47-4e9a07aba587

Co-authored-by: andrewbranch <3277153+andrewbranch@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

Comments suppressed due to low confidence (1)

internal/project/session.go:1203

  • pending watcher IDs are only cleared on successful registration in the newWatcher path. If a watcher times out (MarkPending) and is later removed (updateWatch called with newWatcher == nil), its pending entry is never cleared, so watchRegistry.pending can grow over the lifetime of a session. Consider calling ClearPending for oldWatcher.Watchers().WatcherID when removing/unwatching, or otherwise pruning pending entries when the corresponding watcher is no longer present.
	if oldWatcher != nil {
		w := oldWatcher.Watchers()
		watchers := append(w.WorkspaceWatchers, w.OutsideWorkspaceWatchers...)
		if len(watchers) > 0 {
			var removedIDs []WatcherID
			for _, watcher := range watchers {
				if id, removed := session.watches.Release(watcher); removed {
					removedIDs = append(removedIDs, id)
				}
			}
			for _, id := range removedIDs {
				if err := session.client.UnwatchFiles(watchCtx, id); err != nil {
					errors = append(errors, err)
				} else if logger != nil && newWatcher == nil {
					logger.Log(fmt.Sprintf("Removed watch: %s", id))
				}
			}
		}
	}

Comment on lines +1227 to +1232
// Retry config watchers whose IDs didn't change but whose previous registration failed.
for path, newEntry := range newSnapshot.ConfigFileRegistry.configs {
if oldEntry, ok := oldSnapshot.ConfigFileRegistry.configs[path]; ok {
if oldEntry.rootFilesWatch.ID() == newEntry.rootFilesWatch.ID() && s.watches.IsPending(newEntry.rootFilesWatch.ID()) {
errors = append(errors, updateWatch(ctx, s, s.logger, nil, newEntry.rootFilesWatch)...)
}
Copy link

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pending retry path calls updateWatch(ctx, s, ..., nil, newWatcher). If the original updateWatch call partially succeeded (some WatchFiles calls returned nil before a later one timed out), the registry will still contain refcounts for the successful watchers. Retrying with oldWatcher==nil will Acquire again and permanently increment those refcounts (no matching Release), which can prevent future UnwatchFiles from ever firing and leak client registrations. To keep refcounts consistent, consider making the failure path roll back the entire parent watcher’s bookkeeping when any error occurs (not just the failed watcher), or implement a retry helper that registers only missing keys without incrementing refcounts for already-tracked entries.

Copilot uses AI. Check for mistakes.
Comment thread internal/project/watch.go
Comment on lines +32 to +63
// watchRegistry tracks the current watch globs and how many individual
// WatchedFiles reference each glob. It guards concurrent access with a mutex
// and provides ref-count helpers so callers don't manipulate the map directly.
// It also tracks which parent watcher IDs have pending (failed) registrations
// so that updateWatches can retry them even when the watcher identity hasn't changed.
type watchRegistry struct {
mu sync.Mutex
entries map[fileSystemWatcherKey]*fileSystemWatcherValue
pending map[WatcherID]struct{}
}

func newWatchRegistry() *watchRegistry {
return &watchRegistry{
entries: make(map[fileSystemWatcherKey]*fileSystemWatcherValue),
pending: make(map[WatcherID]struct{}),
}
}

// Acquire increments the ref count for a watcher. If this is the first
// reference (count goes from 0 to 1), it returns true so the caller knows
// to register the watcher with the client.
func (r *watchRegistry) Acquire(watcher *lsproto.FileSystemWatcher, id WatcherID) (isNew bool) {
key := toFileSystemWatcherKey(watcher)
value := r.entries[key]
if value == nil {
value = &fileSystemWatcherValue{id: id}
r.entries[key] = value
}
value.count++
return value.count == 1
}

Copy link

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

watchRegistry is documented as guarding concurrent access with its mutex, but Acquire/Release/MarkPending/ClearPending do not lock r.mu (only IsPending does). Since session.go now locks session.watches.mu directly, the locking discipline is split across call sites, which makes it easy to accidentally call these methods without holding the mutex (data race) and undermines the stated encapsulation. Consider either (1) moving the locking inside watchRegistry methods (and removing external r.mu access), or (2) updating the type/method docs and API to make the required locking explicit (e.g., AcquireLocked/ReleaseLocked or a WithLock helper) so future callers can’t misuse it.

Copilot uses AI. Check for mistakes.
Comment on lines +1139 to 1160
session.watches.mu.Lock()
defer session.watches.mu.Unlock()
// Use a timeout for client requests to prevent holding the mutex indefinitely
// if the client is slow or unresponsive.
watchCtx, watchCancel := context.WithTimeout(ctx, watchRequestTimeout)
defer watchCancel()
if newWatcher != nil {
w := newWatcher.Watchers()
watchers := append(w.WorkspaceWatchers, w.OutsideWorkspaceWatchers...)
if len(watchers) > 0 {
var newWatchers collections.OrderedMap[WatcherID, *lsproto.FileSystemWatcher]
for i, watcher := range watchers {
key := toFileSystemWatcherKey(watcher)
value := session.watches[key]
globId := WatcherID(fmt.Sprintf("%s.%d", w.WatcherID, i))
if value == nil {
value = &fileSystemWatcherValue{id: globId}
session.watches[key] = value
}
value.count++
if value.count == 1 {
if session.watches.Acquire(watcher, globId) {
newWatchers.Set(globId, watcher)
}
}
for id, watcher := range newWatchers.Entries() {
if err := session.client.WatchFiles(ctx, id, []*lsproto.FileSystemWatcher{watcher}); err != nil {
if err := session.client.WatchFiles(watchCtx, id, []*lsproto.FileSystemWatcher{watcher}); err != nil {
session.watches.Release(watcher)
errors = append(errors, err)
} else if logger != nil {
Copy link

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updateWatch creates a single watchCtx with a 1s deadline and reuses it for all WatchFiles/UnwatchFiles calls in this update. If multiple registrations/unregistrations are needed, later calls may inherit very little time remaining (e.g., first call takes 900ms → second gets ~100ms), causing avoidable timeouts even when the client is responsive. Consider creating a fresh per-request timeout context inside the WatchFiles/UnwatchFiles loops (or otherwise resetting the deadline per call) while still ensuring the mutex is not held indefinitely.

Copilot uses AI. Check for mistakes.
Comment on lines 1150 to 1160
for i, watcher := range watchers {
key := toFileSystemWatcherKey(watcher)
value := session.watches[key]
globId := WatcherID(fmt.Sprintf("%s.%d", w.WatcherID, i))
if value == nil {
value = &fileSystemWatcherValue{id: globId}
session.watches[key] = value
}
value.count++
if value.count == 1 {
if session.watches.Acquire(watcher, globId) {
newWatchers.Set(globId, watcher)
}
}
for id, watcher := range newWatchers.Entries() {
if err := session.client.WatchFiles(ctx, id, []*lsproto.FileSystemWatcher{watcher}); err != nil {
if err := session.client.WatchFiles(watchCtx, id, []*lsproto.FileSystemWatcher{watcher}); err != nil {
session.watches.Release(watcher)
errors = append(errors, err)
} else if logger != nil {
Copy link

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On WatchFiles error, updateWatch rolls back bookkeeping via session.watches.Release(watcher) exactly once for that watcher. If the input watcher list contains duplicate patterns (same fileSystemWatcherKey), Acquire will increment the refcount multiple times but only one client registration is attempted; a failed registration would then leave the refcount >0, preventing future retries from ever treating that watch as “new”. This is reachable because some glob lists are only sorted (not deduped) before becoming watchers. Consider deduping watcher keys before calling Acquire, or tracking how many increments were applied per key and fully rolling them back on registration failure.

Copilot uses AI. Check for mistakes.
Comment thread internal/project/watchtimeout_test.go Outdated
Comment on lines +121 to +122
assert.Assert(t, len(successfulIDs) >= 1,
"expected at least one watcher to be registered after retry, got %d", len(successfulIDs))
Copy link

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assertion block says it verifies WatchFiles was retried “with the same watcher IDs”, but the test only checks that there were additional calls and that at least one succeeded. To make the test match its intent (and guard against accidental ID churn), consider asserting that the retry IDs set matches the initial attempted IDs set (or at least that each initial ID is retried).

Suggested change
assert.Assert(t, len(successfulIDs) >= 1,
"expected at least one watcher to be registered after retry, got %d", len(successfulIDs))
firstAttemptSet := make(map[any]struct{}, len(firstAttemptIDs))
for _, id := range firstAttemptIDs {
firstAttemptSet[id] = struct{}{}
}
retrySet := make(map[any]struct{}, len(retryIDs))
for _, id := range retryIDs {
retrySet[id] = struct{}{}
}
assert.Equal(t, len(retrySet), len(firstAttemptSet),
"expected retry to use the same watcher IDs as the first attempt, got first attempt IDs=%v retry IDs=%v",
firstAttemptIDs, retryIDs)
for id := range firstAttemptSet {
_, ok := retrySet[id]
assert.Assert(t, ok,
"expected watcher ID %v from the first attempt to be retried; first attempt IDs=%v retry IDs=%v",
id, firstAttemptIDs, retryIDs)
}
successfulRetriedWatchers := 0
for _, id := range successfulIDs {
if _, ok := retrySet[id]; ok {
successfulRetriedWatchers++
}
}
assert.Assert(t, successfulRetriedWatchers >= 1,
"expected at least one retried watcher to be registered after retry, got %d successful retried watchers (successful IDs=%v, retry IDs=%v)",
successfulRetriedWatchers, successfulIDs, retryIDs)

Copilot uses AI. Check for mistakes.
@andrewbranch
Copy link
Copy Markdown
Member

@copilot address review comments

Copilot AI and others added 2 commits April 29, 2026 16:22
…up watchers, rollback all on failure, stronger test assertions

Agent-Logs-Url: https://github.com/microsoft/typescript-go/sessions/81267cbb-20f5-41d0-b92b-9f1b30604e5d

Co-authored-by: andrewbranch <3277153+andrewbranch@users.noreply.github.com>
Copy link
Copy Markdown
Contributor Author

Copilot AI commented Apr 29, 2026

@copilot address review comments

Addressed all five review comments in 22abe61:

  1. Partial success refcount leak: On any WatchFiles error, ALL newly-acquired watchers are now rolled back (not just the failed one), keeping refcounts clean for retry.
  2. Consistent locking: All watchRegistry methods now consistently require the caller to hold mu. Removed the internal lock from IsPending; callers in updateWatches now lock/unlock around each check.
  3. Per-request timeout: Each WatchFiles/UnwatchFiles call gets its own context.WithTimeout so earlier calls don't consume the deadline for later ones.
  4. Duplicate patterns: Watchers are deduped by key before calling Acquire to prevent refcount inflation.
  5. Stronger test assertions: Test now verifies retry watcher IDs are a subset of the first attempt IDs and that at least one retried watcher succeeded.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Server stops responding after Running scheduled diagnostics refresh

3 participants