Skip to content

fix(amazonq): properly handle encode server exception #5585

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

leigaol
Copy link
Contributor

@leigaol leigaol commented Apr 16, 2025

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)

Description

The encoder server is a single threaded server that may get unresponsive during heavy compute. At the beginning of indexing, the encoder server first parses all js,ts,py,java files using tree sitter, then it proceeds to build vector index.

I have observed that, if a repo has >25k large java files, for example, https://github.com/elastic/elasticsearch (1.4GB), tree sitter parsing can take 6 min 20 seconds. I tried open the elastic search repo and index it for 20 times, among these 20 times, I saw this java.net.ConnectException: Connection refusedshowing up about 10 times. This indicates that the client failed to do a TCP handshake with the server, because the server is busy parsing the files, the node js event loop was not freed to handle TCP handshake but fully focused on parsing. This also explains why we sometimes see project context java.net.SocketTimeoutException: Read timed out.message

The long running index http call has an about 50% chance to fail with connection refused status in the test case(25k large java files), when connection refused, the LSP process also quits. The chance of running into this issue is significantly small for smaller size repos because parsing is done very quickly or repos that is not js,ts,py,java because we do not parse other languages (that is why this bug is not reproducible in the aws toolkit jetbrains kotlin repo).

When there is such connection refused error message, we were doing retry of indexing, which breaks the existing indexing and performs repetitive indexing, which combines the log looping issue fixed in 3357e88 contributed to IDE performance issue.

The vector index process already had some "break out of current event loop" design that pauses the indexing, watch for OS CPU/memory usage, I have yet seen a java.net.ConnectException: Connection refused error while vector indexing.

WARN - software.aws.toolkits.jetbrains.services.amazonq.project.ProjectContextProvider - failed to init project context

java.net.ConnectException: Connection refused
    at java.base/sun.nio.ch.Net.pollConnect(Native Method)
    at java.base/sun.nio.ch.Net.pollConnectNow(Net.java:682)
    at java.base/sun.nio.ch.NioSocketImpl.timedFinishConnect(NioSocketImpl.java:542)
    at java.base/sun.nio.ch.NioSocketImpl.connect(NioSocketImpl.java:592)
    at java.base/java.net.Socket.connect(Socket.java:751)
    at java.base/sun.net.NetworkClient.doConnect(NetworkClient.java:178)
    at java.base/sun.net.www.http.HttpClient.openServer(HttpClient.java:531)
    at java.base/sun.net.www.http.HttpClient.openServer(HttpClient.java:636)
    at java.base/sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:948)
    at java.base/sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:759)
    at java.base/sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1705)
    at java.base/sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1614)
    at java.base/java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:531)
    at software.aws.toolkits.jetbrains.services.amazonq.project.ProjectContextProvider$sendMsgToLsp$4.invokeSuspend(ProjectContextProvider.kt:341)
    at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33)
    at kotlinx.coroutines.DispatchedTask.run(DispatchedTask.kt:104)
    at kotlinx.coroutines.internal.LimitedDispatcher$Worker.run(LimitedDispatcher.kt:111)
    at kotlinx.coroutines.scheduling.TaskImpl.run(Tasks.kt:99)
    at kotlinx.coroutines.scheduling.CoroutineScheduler.runSafely(CoroutineScheduler.kt:608)
    at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.executeTask(CoroutineScheduler.kt:873)
    at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.runWorker(CoroutineScheduler.kt:763)
    at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.run(CoroutineScheduler.kt:750)

Checklist

  • My code follows the code style of this project
  • I have added tests to cover my changes
  • A short description of the change has been added to the CHANGELOG if the change is customer-facing in the IDE.
  • I have added metrics for my changes (if required)

License

I confirm that my contribution is made under the terms of the Apache 2.0 license.

@leigaol leigaol requested a review from a team as a code owner April 16, 2025 04:52
@leigaol leigaol requested a review from Copilot April 16, 2025 05:11
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 1 out of 1 changed files in this pull request and generated no comments.

Comments suppressed due to low confidence (1)

plugins/amazonq/shared/jetbrains-community/src/software/aws/toolkits/jetbrains/services/amazonq/project/ProjectContextProvider.kt:127

  • [nitpick] Consider logging a debug-level message before returning in this branch to clearly indicate that a 'Connection refused' error is expected to be temporarily ignored because the encoder server is busy with Tree-sitter parsing.
if (encoderServer.isNodeProcessRunning()) {

@leigaol leigaol marked this pull request as draft April 16, 2025 05:19
@leigaol leigaol changed the title fix(amazonq): properly handle connection refused from encoder server fix(amazonq): properly handle encode server exception Apr 16, 2025
@leigaol leigaol closed this Apr 16, 2025
if (encoderServer.isNodeProcessRunning()) {
return
} else {
logger.warn(e) { "project context process quit unexpectedly" }
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when the process is died, restarting this function does not help.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

process quit unexpectly issue will be fixed in newer WS LSP releases, this PR is no longer needed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants