Skip to content

[AUTOCUT] Gradle Check Flaky Test Report for DedicatedClusterSnapshotRestoreIT #15806

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
opensearch-ci-bot opened this issue Sep 6, 2024 · 10 comments · Fixed by #17996
Open
Assignees
Labels
autocut disabled-test Issues that are used by an AwaitsFix annotation to temporarily disable a broken test flaky-test Random test failure that succeeds on second run Storage:Snapshots >test-failure Test failure from CI, local build, etc.

Comments

@opensearch-ci-bot
Copy link
Collaborator

opensearch-ci-bot commented Sep 6, 2024

Flaky Test Report for DedicatedClusterSnapshotRestoreIT

Noticed the DedicatedClusterSnapshotRestoreIT has some flaky, failing tests that failed during post-merge actions.

Details

Git Reference Merged Pull Request Build Details Test Name
007600e 17589 54747 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode
087e473 17857 55995 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode
0d86ac1 17814 55896 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode
137683e 17765 55439 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode
14d740f 17620 54711 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode
155f892 17824 56016 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode
2fd3882 17760 55703 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode
3638c13 17968 56558 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode
4560206 17273 55796 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode
51a217a 17865 56008 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode
7372360 17973 56639 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode
7eeb323 17926 56348 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode
8dbf800 17919 56555 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode
9bbdd3c 17868 56128 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode
9db5e67 17878 56051 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode
9de21d1 17021 53454 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode
9e75d42 16493 50152 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode
a586a62 17627 55339 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode
b7dca4a 17830 56012 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode
f211d34 17954 56640 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode
d510b12 16107 48559 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode
175cbd0 15658 46743 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode
4648c3f 17342 53847 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode
f2ecd3e 15755 47362 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode
ffa46ca 17547 54325 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode
05b1cf5 17497 54522 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode
0714a1b 17354 53866 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode
07cb4c9 17446 56222 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode
0819161 17821 55790 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode
09af518 17396 54146 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode
0f53bf9 15596 46460 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode
1074c71 15890 47737 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode
1166998 17565 54640 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode
1275017 17578 54540 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode
15d27a1 17746 55301 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode
1628152 17769 56018 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode
1acba95 17457 54863 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode
1b56084 17670 55875 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode
1b7c055 17395 53658 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode
1c86dd1 17573 54660 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode
21f69ca 17444 54115 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode
32e3eff 17793 55608 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode
33be5a9 15557 46424 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode
34b8888 16013 48230 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode
396add1 17831 55974 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode
399188f 17703 55117 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode
3c6019d 15558 47413 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode
45ec72e 17360 53466 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode
468f120 15724 47162 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode
4b8f9c8 15799 47503 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode
4c98c7e 15559 46090 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode
58794ad 15651 47726 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode
58eb44e 17513 55989 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode
58fc68e 15864 47611 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode
5938bc8 17772 56244 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode
5bbb699 17905 56260 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode
5ec6e9c 17768 55916 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode
5fb4e69 17447 56255 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode
636dea4 17378 53709 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode
693c788 17921 56337 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode
6a05e73 17380 53588 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode
6ff44d9 17908 56245 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode
7712bea 15739 47437 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode
80ba41f 16716 51115 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode
82762d4 15218 46461 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode
82bbdfb 17207 54151 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode
88c7ed1 17575 56067 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode
8ee5eeb 17576 54570 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode
90dc154 17572 54524 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode
91a93da 14580 53548 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode
967eee1 17216 56041 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode
98568e8 17727 55649 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode
a1846cf 16077 48405 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode
a24c858 15805 47501 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode
a968790 15932 47815 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode
ae22e3f 16065 48417 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode
af5835f 17598 54764 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode
af8b4e9 15717 47116 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode
b24c72b 17238 54843 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode
b408ef8 15612 46765 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode
b5be08a 15661 46806 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode
ba11fb9 15668 47059 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode
bfdc30c 15951 47915 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode
c09f79e 17809 55785 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode
c44d230 17749 56412 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode
c48efd0 17535 54292 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode
c7e4911 16120 48565 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode
c91f2f1 15722 47138 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode
ca03fdd 17803 55872 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode
cbaddd3 17938 56527 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode
cd266f3 17299 55706 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode
cf31931 17757 55492 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode
d18982c 17912 56693 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode
d24c9e4 16164 48786 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode
d5e86e1 15749 47363 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode
de832d7 16998 52093 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode
debd040 15524 47397 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode
deeb2de 15579 46545 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode
e306d51 17556 54510 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode
e3d3a17 17594 54690 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode
e62bf1a 17349 53582 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode
e6ffc62 17609 54672 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode
e77fc79 17139 52725 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode
e9d8e00 17726 55668 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode
efde476 17436 54294 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode
f0ea056 16011 48180 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode
f1a8d0e 17944 56453 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode
fc1bf2c 15759 47451 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode

The other pull requests, besides those involved in post-merge actions, that contain failing tests with the DedicatedClusterSnapshotRestoreIT class are:

For more details on the failed tests refer to OpenSearch Gradle Check Metrics dashboard.

@reta
Copy link
Contributor

reta commented Sep 30, 2024

[Catch All Triage - 1, 2, 3, 4]

@cwperks
Copy link
Member

cwperks commented Mar 20, 2025

Raised a PR to fix a reproducible seed with this failed test: #17389

@prudhvigodithi
Copy link
Member

I can see the the DedicatedClusterSnapshotRestoreIT is more flaky and failing on multiple PR's, I was able to re-produce it with specific seed.

./gradlew ':server:internalClusterTest' --tests "org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode" -Dtests.seed=8529B1DD622216C1

./gradlew ':server:internalClusterTest' --tests "org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode" -Dtests.seed=AE568A72925374C5

@andrross @getsaurabh02

@cwperks
Copy link
Member

cwperks commented Apr 16, 2025

I have an open PR to address a reproducible seed for this flaky test

@andrross
Copy link
Member

I was ultimately able to trace the flakiness down to this line:

which was introduced in #17021. @gbbafna Can you take a look? Why was this Files.createDirectories() call added here?

@gbbafna gbbafna self-assigned this Apr 18, 2025
@gbbafna
Copy link
Contributor

gbbafna commented Apr 18, 2025

I was ultimately able to trace the flakiness down to this line:

OpenSearch/server/src/main/java/org/opensearch/common/blobstore/fs/FsBlobContainer.java

Line 228 in d18982c
Files.createDirectories(path);

which was introduced in #17021. @gbbafna Can you take a look? Why was this Files.createDirectories() call added here?

We have to delete the directory in remote migration in case it fails in the first attempt. This is done here . Since the directory is not automatically created for file system and is done manually first time, we had to create it again everytime while creating a file. I will take a look at the test to see why it is failing due to this.

@gbbafna
Copy link
Contributor

gbbafna commented Apr 22, 2025

@andrross : Post your change (thanks for doing this), we are not seeing any failures. Would you recommend waiting out for few more days before further investigating it ?

@andrross
Copy link
Member

@gbbafna I didn't actually fix anything. I just made that test always use RemoteStoreEnums.PathType.FIXED. If any of the other path types are used it will still sometimes fail. It's definitely possible there is still an underlying bug.

@gbbafna
Copy link
Contributor

gbbafna commented Apr 23, 2025

Ack. Will take a look to make sure we don't have any underlying bug.

@andrross
Copy link
Member

andrross commented May 2, 2025

FYI @gbbafna @ashking94 We've made HASHED_PREFIX the default in #18163 but this test is explicitly not using that path type because of intermittent failures.

// TODO: There's likely a bug with other path types where cleanup seems to leave unexpected files
.put(BlobStoreRepository.SHARD_PATH_TYPE.getKey(), RemoteStoreEnums.PathType.FIXED)

@andrross andrross reopened this May 2, 2025
@andrross andrross added the disabled-test Issues that are used by an AwaitsFix annotation to temporarily disable a broken test label May 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
autocut disabled-test Issues that are used by an AwaitsFix annotation to temporarily disable a broken test flaky-test Random test failure that succeeds on second run Storage:Snapshots >test-failure Test failure from CI, local build, etc.
Projects
Status: 🆕 New
6 participants