Skip to content

Fix QueueResizableOpenSearchThreadPoolExecutorTests #18006

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

andrross
Copy link
Member

@andrross andrross commented Apr 18, 2025

There was a race condition in testResizeQueueDown() where depending on
random parameters we could submit up to 1002 tasks into an executor with
a queue size of 900. That introduced a race condition where if the tasks
didn't execute fast enough then a rejected execution exception could
happen and fail the test. The fix is to resize down to a queue size of
1500 to ensure there is enough capacity even if all tasks are submitted
before any can be executed.

And finally I refactored the tests to reduce duplication of code and
ensure the executor gets shutdown properly even in case of a test
failure. This will avoid the spurious thread leak failure if a test case
exits because of a failure.

Related Issues

Resolves #14297

Check List

  • Functionality includes testing.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Copy link
Contributor

❌ Gradle check result for 7e471e1: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❌ Gradle check result for cc0563c: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@reta
Copy link
Collaborator

reta commented Apr 18, 2025

I guess I don't know what the intent of testResizeQueueSameSize() is because the class does not allow you to call resize and pass the same size.

@andrross I think the name of the method may be confusing; just looking into the test, what it does is that

  1. create a pool with a queue of capacity 2000
  2. resize the queue capacity to 1000
  3. submit bunch of tasks and checks that queue size it at most 1000 (the same) or less

Is it helpful?

@andrross
Copy link
Member Author

I guess I don't know what the intent of testResizeQueueSameSize() is because the class does not allow you to call resize and pass the same size.

@andrross I think the name of the method may be confusing; just looking into the test, what it does is that

  1. create a pool with a queue of capacity 2000
  2. resize the queue capacity to 1000
  3. submit bunch of tasks and checks that queue size it at most 1000 (the same) or less

Is it helpful?

Got it. I think we can keep the test as it is.

@andrross andrross force-pushed the fix-QueueResizableOpenSearchThreadPoolExecutorTests branch from cc0563c to 2a562b9 Compare April 21, 2025 15:11
Copy link
Contributor

❌ Gradle check result for 2a562b9: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❌ Gradle check result for 2a562b9: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@andrross andrross force-pushed the fix-QueueResizableOpenSearchThreadPoolExecutorTests branch from 2a562b9 to 7279456 Compare April 21, 2025 21:55
Copy link
Contributor

❌ Gradle check result for 7279456: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❌ Gradle check result for 7279456: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❌ Gradle check result for 7279456: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❌ Gradle check result for a6990d9: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❌ Gradle check result for a6990d9: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

✅ Gradle check result for a6990d9: SUCCESS

Copy link

codecov bot commented Apr 23, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 72.59%. Comparing base (6afec2a) to head (a6990d9).
Report is 6 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main   #18006      +/-   ##
============================================
+ Coverage     72.57%   72.59%   +0.02%     
- Complexity    67160    67170      +10     
============================================
  Files          5478     5478              
  Lines        310130   310132       +2     
  Branches      45087    45087              
============================================
+ Hits         225068   225147      +79     
+ Misses        66661    66573      -88     
- Partials      18401    18412      +11     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Member

@ashking94 ashking94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was a race condition in testResizeQueueDown() where depending on random parameters we could submit up to 1002 tasks into an executor with a queue size of 900. That introduced a race condition where if the tasks didn't execute fast enough then a rejected execution exception could happen and fail the test.

I am unable to see on what fixed this flakiness in the test testResizeQueueDown?

ThreadContext context = new ThreadContext(Settings.EMPTY);
this.queue = new ResizableBlockingQueue<>(
ConcurrentCollections.newBlockingQueue(),
Objects.requireNonNull(queueSize, "All tests must set a queue size")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Objects.requireNonNull() is designed for reference types only. For primitive data types, this check would always pass. Did you intend to do something else? may be value check?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! This was leftover from a previous version of the code where it actually made sense.

}

/** Use a runnable wrapper that simulates a task with unknown failures. */
public void testExceptionThrowingTask() throws Exception {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for testExceptionThrowingTask, the runnable wrapper was earlier exceptionalWrapper() which is changed to fastWrapper. What's the reason for this change?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another good catch, this got mixed up in the refactoring. I'll fix it.

@andrross
Copy link
Member Author

There was a race condition in testResizeQueueDown() where depending on random parameters we could submit up to 1002 tasks into an executor with a queue size of 900. That introduced a race condition where if the tasks didn't execute fast enough then a rejected execution exception could happen and fail the test.

I am unable to see on what fixed this flakiness in the test testResizeQueueDown?

Instead of resizing down to 900, it resizes down to 1500 which guarantees that the executor has enough capacity to not reject anything if all tasks are submitted before any are able to be executed.

@andrross andrross force-pushed the fix-QueueResizableOpenSearchThreadPoolExecutorTests branch from a6990d9 to 210949d Compare April 24, 2025 16:35
Copy link
Contributor

❌ Gradle check result for 210949d: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

There was a race condition in testResizeQueueDown() where depending on
random parameters we could submit up to 1002 tasks into an executor with
a queue size of 900. That introduced a race condition where if the tasks
didn't execute fast enough then a rejected execution exception could
happen and fail the test. The fix is to resize down to a queue size of
1500 to ensure there is enough capacity even if all tasks are submitted
before any can be executed.

And finally I refactored the tests to reduce duplication of code and
ensure the executor gets shutdown properly even in case of a test
failure. This will avoid the spurious thread leak failure if a test case
exits because of a failure.

Signed-off-by: Andrew Ross <andrross@amazon.com>
@andrross andrross force-pushed the fix-QueueResizableOpenSearchThreadPoolExecutorTests branch from 210949d to 7ba1155 Compare April 24, 2025 20:29
Copy link
Contributor

❌ Gradle check result for 7ba1155: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❌ Gradle check result for 7ba1155: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
autocut Cluster Manager flaky-test Random test failure that succeeds on second run Other skip-changelog >test-failure Test failure from CI, local build, etc.
Projects
Status: 👀 In review
Development

Successfully merging this pull request may close these issues.

[AUTOCUT] Gradle Check Flaky Test Report for QueueResizableOpenSearchThreadPoolExecutorTests
3 participants