Skip to content

[META] Eliminate flakiness in :server:internalClusterTest #18108

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
andrross opened this issue Apr 28, 2025 · 11 comments
Open

[META] Eliminate flakiness in :server:internalClusterTest #18108

andrross opened this issue Apr 28, 2025 · 11 comments
Labels
flaky-test Random test failure that succeeds on second run Meta Meta issue, not directly linked to a PR :test Adding or fixing a test

Comments

@andrross
Copy link
Member

andrross commented Apr 28, 2025

Please describe the end goal of this project

We have no shortage of issues related to flaky tests. I'm creating this new issue with a more narrow focus (only :server:internalClusterTest) and specific goal (eliminate current flakiness). The intent is to get the most problematic flakiness back under control to make merging PRs a less miserable experience, while we continue to iterate on new mechanisms in #17974

Supporting References

Flakiness of :server:internalClusterTest is simple to measure, and I'll continue posting updates on this issue to track progress.

Test Environment:

  • OS: Ubuntu 24.04.2 LTS
  • Host type: m8g.4xlarge (EC2)
  • JDK: Temurin-21.0.5+11

Test procedure:

% export RESULT_DIR=~/test-results-$(date +"%Y-%m-%d")-$(git rev-parse --verify HEAD --short=8)
% mkdir $RESULT_DIR
% for i in `seq 0 100` ; do ./gradlew ':server:internalClusterTest' 2> $RESULT_DIR/server_internalClusterTest-$(date +"%Y-%m-%d_%H-%M-%S") ; done

Count failures:

tail -n1 $RESULT_DIR/* | grep FAILED  | wc -l

Count number of runs:

ls $RESULT_DIR | wc -l

Display test failures by count:

grep '^REPROD' $RESULT_DIR/* | cut -d ' ' -f6 | sort | uniq -c | sort -rn

Issues

Related to #17974

Related component

Build

@andrross andrross added Meta Meta issue, not directly linked to a PR untriaged labels Apr 28, 2025
@andrross
Copy link
Member Author

andrross commented Apr 28, 2025

Results for April 24 (c5e55b0)

Success rate:

34% (44 out of 128)

Failed tests

% grep '^REPROD' $RESULT_DIR/* | cut -d ' ' -f6 | sort | uniq -c | sort -rn
     60 "org.opensearch.search.simple.SimpleSearchIT"
     18 "org.opensearch.recovery.RecoveryWhileUnderLoadIT"
     16 "org.opensearch.repositories.RepositoriesServiceIT.testCreatSnapAndUpdateReposityCauseInfiniteLoop"
     14 "org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock"
      9 "org.opensearch.gateway.remote.RemoteStatePublicationIT.testRemotePublicationDownloadStats"
      6 "org.opensearch.remotestore.RemoteStorePinnedTimestampsGarbageCollectionIT.testIndexDeletionWithPinnedTimestamps"
      6 "org.opensearch.indices.settings.SearchOnlyReplicaIT.testFailoverWithSearchReplicaWhenSearchNodeRestarts"
      5 "org.opensearch.remotestore.RemoteStorePinnedTimestampsGarbageCollectionIT.testLiveIndexNoPinnedTimestampsWithExtraGenSetting"
      5 "org.opensearch.remotestore.RemoteStorePinnedTimestampsGarbageCollectionIT.testIndexDeletionNoPinnedTimestamps"
      5 "org.opensearch.index.ClusterMaxMergesAtOnceIT.testClusterLevelDefaultUpdatesMergePolicy"
      4 "org.opensearch.remotestore.RemoteStorePinnedTimestampsGarbageCollectionIT.testLiveIndexWithPinnedTimestamps"
      4 "org.opensearch.indices.replication.WarmIndexSegmentReplicationIT.testNodeDropWithOngoingReplication"
      4 "org.opensearch.action.admin.indices.create.CreateIndexIT.testCreateAndDeleteIndexConcurrently"
      2 "org.opensearch.wlm.WorkloadManagementIT"
      2 "org.opensearch.remotestore.RemoteStorePinnedTimestampsGarbageCollectionIT.testLiveIndexNoPinnedTimestamps"
      2 "org.opensearch.recovery.RelocationIT"
      2 "org.opensearch.indices.replication.WarmIndexSegmentReplicationIT"
      2 "org.opensearch.index.ShardIndexingPressureSettingsIT.testShardIndexingPressureFeatureEnabledDisabledSetting"
      2 "org.opensearch.discovery.ClusterDisruptionIT.testAckedIndexing"
      2 "org.opensearch.cluster.routing.WeightedRoutingIT.testClusterHealthResponseWithEnsureNodeWeighedInParam"
      2 "org.opensearch.action.admin.cluster.stats.ClusterStatsIT.testClusterStatsWithMappingsAndAnalysisStatsIndexMetricsFilter"
      1 "org.opensearch.remotemigration.RemoteMigrationIndexMetadataUpdateIT.testIndexSettingsUpdatedEvenForMisconfiguredReplicas"
      1 "org.opensearch.indices.stats.IndexStatsIT"
      1 "org.opensearch.indices.settings.UpdateNumberOfReplicasIT.testUpdateWithInvalidNumberOfReplicas"
      1 "org.opensearch.indices.settings.UpdateNumberOfReplicasIT.testSimpleUpdateNumberOfReplicas"
      1 "org.opensearch.indices.settings.UpdateNumberOfReplicasIT.testAwarenessReplicaBalanceWithUseZoneForDefaultReplicaCount"
      1 "org.opensearch.indices.settings.UpdateNumberOfReplicasIT.testAwarenessReplicaBalance"
      1 "org.opensearch.indices.settings.UpdateNumberOfReplicasIT.testAutoExpandNumberReplicas2"
      1 "org.opensearch.indices.replication.WarmIndexSegmentReplicationIT.testPrimaryReceivesDocsDuringReplicaRecovery"
      1 "org.opensearch.indices.replication.WarmIndexSegmentReplicationIT.testDropPrimaryDuringReplication"
      1 "org.opensearch.indices.replication.SegmentReplicationStatsIT.testSegmentReplicationNodeAndIndexStats"
      1 "org.opensearch.indices.replication.SegmentReplicationResizeRequestIT.testCreateShrinkIndexThrowsExceptionWhenReplicasBehind"
      1 "org.opensearch.indices.IndicesRequestCacheCleanupIT.testCacheCleanupWithDefaultSettings"
      1 "org.opensearch.indexing.IndexActionIT"
      1 "org.opensearch.index.ShardIndexingPressureIT.testShardIndexingPressureTrackingDuringBulkWrites"
      1 "org.opensearch.discovery.ClusterManagerDisruptionIT.testIsolateClusterManagerAndVerifyClusterStateConsensus"
      1 "org.opensearch.action.admin.cluster.node.tasks.ConcurrentSearchTasksIT.testConcurrentSearchTaskTracking"

Actions

@andrross andrross added flaky-test Random test failure that succeeds on second run :test Adding or fixing a test and removed untriaged labels Apr 28, 2025
@andrross
Copy link
Member Author

Results for April 30 (0795bb2)

Success rate:

70% (30 out of 43)

Failed tests

      8 "org.opensearch.gateway.remote.RemoteStatePublicationIT.testRemotePublicationDownloadStats"
      4 "org.opensearch.action.admin.indices.create.CreateIndexIT.testCreateAndDeleteIndexConcurrently"
      3 "org.opensearch.indices.replication.WarmIndexSegmentReplicationIT"
      2 "org.opensearch.indices.replication.WarmIndexSegmentReplicationIT.testDropPrimaryDuringReplication"
      1 "org.opensearch.index.ShardIndexingPressureIT.testShardIndexingPressureTrackingDuringBulkWrites"
      1 "org.opensearch.discovery.ClusterDisruptionIT.testAckedIndexing"
      1 "org.opensearch.cluster.routing.WeightedRoutingIT.testClusterHealthResponseWithEnsureNodeWeighedInParam"
      1 "org.opensearch.action.bulk.BulkIntegrationIT.testDeleteIndexWhileIndexing"
      1 "org.opensearch.action.admin.cluster.state.TransportClusterStateActionDisruptionIT.testLocalRequestAlwaysSucceeds"
      1 "org.opensearch.action.admin.cluster.node.tasks.ConcurrentSearchTasksIT.testConcurrentSearchTaskTracking"

Actions

@andrross
Copy link
Member Author

andrross commented May 1, 2025

Results for May 1 (0cbd848)

Success rate:

92% (34 out of 37, progress!)

Failed tests

      1 "org.opensearch.indices.state.CloseWhileRelocatingShardsIT.testCloseWhileRelocatingShards"
      1 "org.opensearch.gateway.remote.RemoteStatePublicationIT.testRemotePublicationSettingChangePersistedAfterRestart"
      1 "org.opensearch.discovery.ClusterDisruptionIT.testAckedIndexing"
      1 "org.opensearch.cluster.routing.WeightedRoutingIT.testClusterHealthResponseWithEnsureNodeWeighedInParam"``

Actions

@andrross
Copy link
Member Author

andrross commented May 7, 2025

Results for May 6 (c32262d)

Success rate:

75% (30 out of 40)

Failed tests

      6 "org.opensearch.search.simple.SimpleSearchIT"
      3 "org.opensearch.repositories.RepositoriesServiceIT.testCreatSnapAndUpdateReposityCauseInfiniteLoop"
      2 "org.opensearch.action.admin.indices.create.CreateIndexIT.testCreateAndDeleteIndexConcurrently"
      1 "org.opensearch.remotemigration.RemoteMigrationIndexMetadataUpdateIT.testIndexSettingsUpdatedOnlyForMigratingIndex"
      1 "org.opensearch.indexing.IndexActionIT"
      1 "org.opensearch.cluster.routing.WeightedRoutingIT.testClusterHealthResponseWithEnsureNodeWeighedInParam"
      1 "org.opensearch.action.admin.cluster.stats.ClusterStatsIT.testClusterStatsWithMappingsAndAnalysisStatsIndexMetricsFilter"

Actions

@andrross
Copy link
Member Author

Results for May 14 (998ae73)

Success rate:

71% (27 out of 38)

Failed tests

      9 "org.opensearch.wlm.WorkloadManagementIT"
      2 "org.opensearch.action.admin.indices.create.CreateIndexIT.testCreateAndDeleteIndexConcurrently"
      1 "org.opensearch.indices.recovery.IndexRecoveryIT.testRerouteRecovery"
      1 "org.opensearch.index.ShardIndexingPressureIT.testShardIndexingPressureTrackingDuringBulkWrites"
      1 "org.opensearch.cluster.routing.WeightedRoutingIT.testClusterHealthResponseWithEnsureNodeWeighedInParam"

Actions

@andrross
Copy link
Member Author

Results for May 15 (20d56d2)

Success rate:

72% (31 of 43)

Failed tests

      8 "org.opensearch.wlm.WorkloadManagementIT"
      2 "org.opensearch.recovery.RecoveryWhileUnderLoadIT"
      1 "org.opensearch.indices.state.CloseWhileRelocatingShardsIT.testCloseWhileRelocatingShards"
      1 "org.opensearch.indices.replication.SegmentReplicationIT.testNodeDropWithOngoingReplication"
      1 "org.opensearch.index.ShardIndexingPressureSettingsIT.testShardIndexingPressureFeatureEnabledDisabledSetting"
      1 "org.opensearch.action.admin.cluster.state.TransportClusterStateActionDisruptionIT.testNonLocalRequestAlwaysFindsClusterManagerAndWaitsForMetadata"

Actions

@Divyaasm
Copy link
Contributor

Hey @andrross, I've implemented the same setup. Looks like the runs will take several hours. Will automate this action through a jenkins job where we could get the list of failed tests after all iterations once the job runs successfully. And eventually we can ingest the data to the metrics cluster. Will tag the related issues once created.
Thanks!

@andrross
Copy link
Member Author

Looks like the runs will take several hours

@Divyaasm A successful run of :server:internalClusterTest on a 16 core machine takes roughly 30 minutes in my testing. Note that the 100 iterations in the bash for loop above is arbitrary and I've never actually run 100 iterations because that would take multiple days.

@andrross
Copy link
Member Author

andrross commented May 22, 2025

Results for May 21 (d723db8)

Success rate:

68% (25 of 37)

Failed tests

     10 "org.opensearch.wlm.WorkloadManagementIT"
      2 "org.opensearch.action.admin.indices.create.CreateIndexIT.testCreateAndDeleteIndexConcurrently"
      1 "org.opensearch.cluster.metadata.AutoExpandSearchReplicasIT.testAutoExpandSearchReplica"

Actions

@Divyaasm
Copy link
Contributor

Yes my estimate is it could take a couple of days. We can make the number of iterations flexible as per our requirement before we trigger the job.

@prudhvigodithi
Copy link
Member

Should we test with the -Dtests.iters=N option? example ./gradlew ':server:internalClusterTest' -Dtests.iters=100, from my tests in past https://github.com/prudhvigodithi/automations/tree/main/opensearch-gradle-check#using-gradle-options, the -Dtests.iters is much faster than we run in a loop as this use re-use the existing Gradle cache and Gradle daemon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
flaky-test Random test failure that succeeds on second run Meta Meta issue, not directly linked to a PR :test Adding or fixing a test
Projects
Status: New
Development

No branches or pull requests

3 participants