Fix QueueResizableOpenSearchThreadPoolExecutorTests #18006

andrross · 2025-04-18T17:56:13Z

There was a race condition in testResizeQueueDown() where depending on
random parameters we could submit up to 1002 tasks into an executor with
a queue size of 900. That introduced a race condition where if the tasks
didn't execute fast enough then a rejected execution exception could
happen and fail the test. The fix is to resize down to a queue size of
1500 to ensure there is enough capacity even if all tasks are submitted
before any can be executed.

And finally I refactored the tests to reduce duplication of code and
ensure the executor gets shutdown properly even in case of a test
failure. This will avoid the spurious thread leak failure if a test case
exits because of a failure.

Related Issues

Resolves #14297

Check List

Functionality includes testing.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

github-actions · 2025-04-18T18:12:26Z

❌ Gradle check result for 7e471e1: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2025-04-18T19:16:42Z

❌ Gradle check result for cc0563c: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

reta · 2025-04-18T22:45:42Z

I guess I don't know what the intent of testResizeQueueSameSize() is because the class does not allow you to call resize and pass the same size.

@andrross I think the name of the method may be confusing; just looking into the test, what it does is that

create a pool with a queue of capacity 2000
resize the queue capacity to 1000
submit bunch of tasks and checks that queue size it at most 1000 (the same) or less

Is it helpful?

andrross · 2025-04-21T15:09:41Z

I guess I don't know what the intent of testResizeQueueSameSize() is because the class does not allow you to call resize and pass the same size.

@andrross I think the name of the method may be confusing; just looking into the test, what it does is that

create a pool with a queue of capacity 2000

resize the queue capacity to 1000

submit bunch of tasks and checks that queue size it at most 1000 (the same) or less

Is it helpful?

Got it. I think we can keep the test as it is.

github-actions · 2025-04-21T15:38:17Z

❌ Gradle check result for 2a562b9: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2025-04-21T21:27:36Z

❌ Gradle check result for 2a562b9: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2025-04-21T22:56:28Z

❌ Gradle check result for 7279456: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2025-04-21T23:59:30Z

❌ Gradle check result for 7279456: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2025-04-22T06:11:56Z

❌ Gradle check result for 7279456: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2025-04-23T00:06:49Z

❌ Gradle check result for a6990d9: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2025-04-23T01:20:05Z

❌ Gradle check result for a6990d9: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2025-04-23T03:55:05Z

✅ Gradle check result for a6990d9: SUCCESS

codecov · 2025-04-23T03:55:32Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 72.59%. Comparing base (6afec2a) to head (a6990d9).
Report is 6 commits behind head on main.

Additional details and impacted files

@@             Coverage Diff              @@
##               main   #18006      +/-   ##
============================================
+ Coverage     72.57%   72.59%   +0.02%     
- Complexity    67160    67170      +10     
============================================
  Files          5478     5478              
  Lines        310130   310132       +2     
  Branches      45087    45087              
============================================
+ Hits         225068   225147      +79     
+ Misses        66661    66573      -88     
- Partials      18401    18412      +11

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

ashking94

There was a race condition in testResizeQueueDown() where depending on random parameters we could submit up to 1002 tasks into an executor with a queue size of 900. That introduced a race condition where if the tasks didn't execute fast enough then a rejected execution exception could happen and fail the test.

I am unable to see on what fixed this flakiness in the test testResizeQueueDown?

ashking94 · 2025-04-24T06:19:14Z

...a/org/opensearch/common/util/concurrent/QueueResizableOpenSearchThreadPoolExecutorTests.java

+        ThreadContext context = new ThreadContext(Settings.EMPTY);
+        this.queue = new ResizableBlockingQueue<>(
+            ConcurrentCollections.newBlockingQueue(),
+            Objects.requireNonNull(queueSize, "All tests must set a queue size")


Objects.requireNonNull() is designed for reference types only. For primitive data types, this check would always pass. Did you intend to do something else? may be value check?

Good catch! This was leftover from a previous version of the code where it actually made sense.

ashking94 · 2025-04-24T06:25:08Z

...a/org/opensearch/common/util/concurrent/QueueResizableOpenSearchThreadPoolExecutorTests.java

    }

    /** Use a runnable wrapper that simulates a task with unknown failures. */
-    public void testExceptionThrowingTask() throws Exception {


for testExceptionThrowingTask, the runnable wrapper was earlier exceptionalWrapper() which is changed to fastWrapper. What's the reason for this change?

Another good catch, this got mixed up in the refactoring. I'll fix it.

andrross · 2025-04-24T15:06:12Z

There was a race condition in testResizeQueueDown() where depending on random parameters we could submit up to 1002 tasks into an executor with a queue size of 900. That introduced a race condition where if the tasks didn't execute fast enough then a rejected execution exception could happen and fail the test.

I am unable to see on what fixed this flakiness in the test testResizeQueueDown?

Instead of resizing down to 900, it resizes down to 1500 which guarantees that the executor has enough capacity to not reject anything if all tasks are submitted before any are able to be executed.

github-actions · 2025-04-24T16:50:40Z

❌ Gradle check result for 210949d: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

There was a race condition in testResizeQueueDown() where depending on random parameters we could submit up to 1002 tasks into an executor with a queue size of 900. That introduced a race condition where if the tasks didn't execute fast enough then a rejected execution exception could happen and fail the test. The fix is to resize down to a queue size of 1500 to ensure there is enough capacity even if all tasks are submitted before any can be executed. And finally I refactored the tests to reduce duplication of code and ensure the executor gets shutdown properly even in case of a test failure. This will avoid the spurious thread leak failure if a test case exits because of a failure. Signed-off-by: Andrew Ross <andrross@amazon.com>

github-actions · 2025-04-24T21:33:37Z

❌ Gradle check result for 7ba1155: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2025-04-24T23:14:26Z

❌ Gradle check result for 7ba1155: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

andrross added the skip-changelog label Apr 18, 2025

github-actions bot added >test-failure Test failure from CI, local build, etc. autocut Cluster Manager flaky-test Random test failure that succeeds on second run Other labels Apr 18, 2025

github-project-automation bot added this to Cluster Manager Project Board Apr 18, 2025

andrross force-pushed the fix-QueueResizableOpenSearchThreadPoolExecutorTests branch from cc0563c to 2a562b9 Compare April 21, 2025 15:11

opensearch-ci-bot mentioned this pull request Apr 21, 2025

[AUTOCUT] Gradle Check Flaky Test Report for SecureReactorNetty4HttpServerTransportTests #17486

Open

andrross force-pushed the fix-QueueResizableOpenSearchThreadPoolExecutorTests branch from 2a562b9 to 7279456 Compare April 21, 2025 21:55

This was referenced Apr 21, 2025

[AUTOCUT] Gradle Check Flaky Test Report for SimpleSearchIT #16851

Closed

[AUTOCUT] Gradle Check Flaky Test Report for SegmentReplicationResizeRequestIT #17552

Open

opensearch-ci-bot mentioned this pull request Apr 22, 2025

[AUTOCUT] Gradle Check Flaky Test Report for SearchRestCancellationIT #14311

Open

andrross force-pushed the fix-QueueResizableOpenSearchThreadPoolExecutorTests branch from 7279456 to a6990d9 Compare April 22, 2025 23:05

opensearch-ci-bot mentioned this pull request Apr 23, 2025

[AUTOCUT] Gradle Check Flaky Test Report for MinimumClusterManagerNodesIT #14289

Open

ashking94 requested changes Apr 24, 2025

View reviewed changes

github-project-automation bot moved this to 👀 In review in Cluster Manager Project Board Apr 24, 2025

andrross force-pushed the fix-QueueResizableOpenSearchThreadPoolExecutorTests branch from a6990d9 to 210949d Compare April 24, 2025 16:35

andrross force-pushed the fix-QueueResizableOpenSearchThreadPoolExecutorTests branch from 210949d to 7ba1155 Compare April 24, 2025 20:29

opensearch-ci-bot mentioned this pull request Apr 25, 2025

[AUTOCUT] Gradle Check Flaky Test Report for S3BlobContainerRetriesTests #17551

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix QueueResizableOpenSearchThreadPoolExecutorTests #18006

Fix QueueResizableOpenSearchThreadPoolExecutorTests #18006

andrross commented Apr 18, 2025 •

edited

Loading

github-actions bot commented Apr 18, 2025

github-actions bot commented Apr 18, 2025

reta commented Apr 18, 2025

andrross commented Apr 21, 2025

github-actions bot commented Apr 21, 2025

github-actions bot commented Apr 21, 2025

github-actions bot commented Apr 21, 2025

github-actions bot commented Apr 21, 2025

github-actions bot commented Apr 22, 2025

github-actions bot commented Apr 23, 2025

github-actions bot commented Apr 23, 2025

github-actions bot commented Apr 23, 2025

codecov bot commented Apr 23, 2025

ashking94 left a comment

ashking94 Apr 24, 2025

andrross Apr 24, 2025

ashking94 Apr 24, 2025

andrross Apr 24, 2025

andrross commented Apr 24, 2025

github-actions bot commented Apr 24, 2025

github-actions bot commented Apr 24, 2025

github-actions bot commented Apr 24, 2025

Fix QueueResizableOpenSearchThreadPoolExecutorTests #18006

Are you sure you want to change the base?

Fix QueueResizableOpenSearchThreadPoolExecutorTests #18006

Conversation

andrross commented Apr 18, 2025 • edited Loading

Related Issues

Check List

github-actions bot commented Apr 18, 2025

github-actions bot commented Apr 18, 2025

reta commented Apr 18, 2025

andrross commented Apr 21, 2025

github-actions bot commented Apr 21, 2025

github-actions bot commented Apr 21, 2025

github-actions bot commented Apr 21, 2025

github-actions bot commented Apr 21, 2025

github-actions bot commented Apr 22, 2025

github-actions bot commented Apr 23, 2025

github-actions bot commented Apr 23, 2025

github-actions bot commented Apr 23, 2025

codecov bot commented Apr 23, 2025

Codecov Report

ashking94 left a comment

Choose a reason for hiding this comment

ashking94 Apr 24, 2025

Choose a reason for hiding this comment

andrross Apr 24, 2025

Choose a reason for hiding this comment

ashking94 Apr 24, 2025

Choose a reason for hiding this comment

andrross Apr 24, 2025

Choose a reason for hiding this comment

andrross commented Apr 24, 2025

github-actions bot commented Apr 24, 2025

github-actions bot commented Apr 24, 2025

github-actions bot commented Apr 24, 2025

andrross commented Apr 18, 2025 •

edited

Loading