Skip to content

Fix flaky tests in GoogleCloudStorageBlobStoreRepositoryTests, S3BlobStoreRepositoryTests, AzureBlobStoreRepositoryTests #18290

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 14, 2025

Conversation

kkewwei
Copy link
Contributor

@kkewwei kkewwei commented May 14, 2025

Description

I run test with tests.seed=9D496123288AF73F in S3BlobStoreRepositoryTests. testSnapshotAndRestore , and find that every request will retry 3 times, and sleep with backoff policy, it will cost too much time, which will lead to flaky tests.

[2025-05-14T16:15:21,549][DEBUG][s.a.a.request            ] [node_t0] Retryable error detected. Will retry in 52ms. Request attempt number 1
......
[2025-05-14T16:15:21,647][DEBUG][s.a.a.request            ] [node_t0] Retryable error detected. Will retry in 145ms. Request attempt number 2

Request is as follows:
GET /bucket?list-type=2&delimiter=%2F&prefix=index-
PUT /bucket/r10011100011010/indices/Hfd8LcIaQmu_nAtfzC7YJg/5/__Qf9SKHinSJWqsxDYO2Qrpw

Related Issues

Resolves #14291 #14299 #11493

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@kkewwei kkewwei requested a review from a team as a code owner May 14, 2025 13:22
@github-actions github-actions bot added >test-failure Test failure from CI, local build, etc. autocut flaky-test Random test failure that succeeds on second run Storage:Snapshots labels May 14, 2025
…StoreRepositoryTests, AzureBlobStoreRepositoryTests

Signed-off-by: kkewwei <kewei.11@bytedance.com>
Signed-off-by: kkewwei <kkewwei@163.com>
@kkewwei
Copy link
Contributor Author

kkewwei commented May 14, 2025

@andrross @reta You may be interested. Please have a look in your spare time.

Copy link
Contributor

❌ Gradle check result for 8c42ae9: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@andrross
Copy link
Member

This is great @kkewwei, thank you!

Copy link
Contributor

✅ Gradle check result for 8c42ae9: SUCCESS

Copy link

codecov bot commented May 14, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 72.60%. Comparing base (998ae73) to head (8c42ae9).
Report is 3 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main   #18290      +/-   ##
============================================
+ Coverage     72.48%   72.60%   +0.11%     
- Complexity    67357    67432      +75     
============================================
  Files          5488     5488              
  Lines        311023   311023              
  Branches      45217    45217              
============================================
+ Hits         225444   225809     +365     
+ Misses        67282    66829     -453     
- Partials      18297    18385      +88     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@andrross
Copy link
Member

A couple thoughts here for posterity:

  • I hardcoded a 25% failure rate and S3BlobStoreRepositoryTests passed in about 30 seconds. Previously with the "always retry 3 times" setting the test would run from 20+ minutes
  • 25% is still a very high failure rate and should adequately test that the repositories can handle transient failures
  • I suspect the default retry policies of the object store clients have changed over time as those clients are upgraded and that this test wasn't always problematic

@andrross andrross merged commit bde7db5 into opensearch-project:main May 14, 2025
43 checks passed
@github-project-automation github-project-automation bot moved this from 👀 In review to ✅ Done in Storage Project Board May 14, 2025
opensearch-trigger-bot bot pushed a commit that referenced this pull request May 14, 2025
…StoreRepositoryTests, AzureBlobStoreRepositoryTests (#18290)

Signed-off-by: kkewwei <kewei.11@bytedance.com>
Signed-off-by: kkewwei <kkewwei@163.com>
(cherry picked from commit bde7db5)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
andrross pushed a commit that referenced this pull request May 14, 2025
…StoreRepositoryTests, AzureBlobStoreRepositoryTests (#18290) (#18298)

(cherry picked from commit bde7db5)

Signed-off-by: kkewwei <kewei.11@bytedance.com>
Signed-off-by: kkewwei <kkewwei@163.com>
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
@kkewwei kkewwei deleted the fix_14299 branch May 14, 2025 23:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
autocut backport 2.19 flaky-test Random test failure that succeeds on second run skip-changelog Storage:Snapshots >test-failure Test failure from CI, local build, etc.
Projects
Status: ✅ Done
Development

Successfully merging this pull request may close these issues.

[AUTOCUT] Gradle Check Flaky Test Report for AzureBlobStoreRepositoryTests
2 participants