Skip to content

Fix flaky test SegmentReplicationIT.testReplicaAlreadyAtCheckpoint #17216

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Apr 10, 2025

Conversation

skumawat2025
Copy link
Contributor

@skumawat2025 skumawat2025 commented Jan 31, 2025

Description

In the SegmentReplicationIT.testReplicaAlreadyAtCheckpoint test, we are creating three nodes with one primary node and two replica nodes. After ingesting documents to the primary shard, we are not checking if the segment replication to both replica 1 and 2 has finished. Without verifying this, we are stopping the primary node. This behavior leads to the test being flaky when the replication has not completed.

1> org.opensearch.transport.NodeDisconnectedException: [node_t0][127.0.0.1:41957][disconnected] disconnected
  1> [2025-01-15T22:22:47,061][INFO ][o.o.c.c.FollowersChecker ] [node_t0] FollowerChecker{discoveryNode={node_t1}{jpNpHksYQ7-Fb8W2sTp8Sg}{RSZGRaXJQfWjT8jbA4p4iA}{127.0.0.1}{127.0.0.1:42245}{d}{shard_indexing_pressure_enabled=true}, failureCountSinceLastSuccess=0, [cluster.fault_detection.follower_check.retry_count]=3} marking node as faulty
  1> [2025-01-15T22:22:47,057][ERROR][o.o.i.r.SegmentReplicationTargetService] [node_t3] [shardId [test-idx-1][0]] [replication id 5] Replication failed, timing data: {INIT=0, GET_CHECKPOINT_INFO=1, FILE_DIFF=0, REPLICATING=0}
  1> org.opensearch.indices.replication.common.ReplicationFailedException: Segment Replication failed
  1> 	at org.opensearch.indices.replication.SegmentReplicator$2.onFailure(SegmentReplicator.java:154) [main/:?]
  1> 	at org.opensearch.core.action.ActionListener$1.onFailure(ActionListener.java:90) [opensearch-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
  1> 	at org.opensearch.action.ActionRunnable.onFailure(ActionRunnable.java:104) [main/:?]
  1> 	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:54) [main/:?]
  1> 	at org.opensearch.common.util.concurrent.OpenSearchExecutors$DirectExecutorService.execute(OpenSearchExecutors.java:341) [main/:?]
  1> 	at org.opensearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:120) [main/:?]
  1> 	at org.opensearch.common.util.concurrent.ListenableFuture.lambda$done$0(ListenableFuture.java:112) [main/:?]
  1> 	at java.base/java.util.ArrayList.forEach(ArrayList.java:1597) [?:?]
  1> 	at org.opensearch.common.util.concurrent.ListenableFuture.done(ListenableFuture.java:112) [main/:?]
  1> 	at org.opensearch.common.util.concurrent.BaseFuture.setException(BaseFuture.java:178) [main/:?]
  1> 	at org.opensearch.common.util.concurrent.ListenableFuture.onFailure(ListenableFuture.java:149) [main/:?]
  1> 	at org.opensearch.action.StepListener.innerOnFailure(StepListener.java:84) [main/:?]
  1> 	at org.opensearch.core.action.NotifyOnceListener.onFailure(NotifyOnceListener.java:65) [opensearch-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
  1> 	at org.opensearch.action.ActionListenerResponseHandler.handleException(ActionListenerResponseHandler.java:75) [main/:?]
  1> 	at org.opensearch.telemetry.tracing.handler.TraceableTransportResponseHandler.handleException(TraceableTransportResponseHandler.java:81) [main/:?]
  1> 	at org.opensearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1505) [main/:?]
  1> 	at org.opensearch.transport.TransportService$8.run(TransportService.java:1357) [main/:?]
  1> 	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:932) [main/:?]
  1> 	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) [?:?]
  1> 	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) [?:?]
  1> 	at java.base/java.lang.Thread.run(Thread.java:1575) [?:?]
  1> Caused by: org.opensearch.transport.NodeDisconnectedException: [node_t1][127.0.0.1:42245][internal:index/shard/replication/get_segment_files] disconnected
  1> [2025-01-15T22:22:47,057][ERROR][o.o.i.r.SegmentReplicationTargetService] [node_t2] [shardId [test-idx-1][0]] [replication id 6] Replication failed, timing data: {INIT=0, GET_CHECKPOINT_INFO=1, FILE_DIFF=0, REPLICATING=0}
  1> org.opensearch.indices.replication.common.ReplicationFailedException: Segment Replication failed
  1> 	at org.opensearch.indices.replication.SegmentReplicator$2.onFailure(SegmentReplicator.java:154) [main/:?]
  1> 	at org.opensearch.core.action.ActionListener$1.onFailure(ActionListener.java:90) [opensearch-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
  1> 	at org.opensearch.action.ActionRunnable.onFailure(ActionRunnable.java:104) [main/:?]
  1> 	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:54) [main/:?]
  1> 	at org.opensearch.common.util.concurrent.OpenSearchExecutors$DirectExecutorService.execute(OpenSearchExecutors.java:341) [main/:?]
  1> 	at org.opensearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:120) [main/:?]
  1> 	at org.opensearch.common.util.concurrent.ListenableFuture.lambda$done$0(ListenableFuture.java:112) [main/:?]
  1> 	at java.base/java.util.ArrayList.forEach(ArrayList.java:1597) [?:?]
  1> 	at org.opensearch.common.util.concurrent.ListenableFuture.done(ListenableFuture.java:112) [main/:?]
  1> 	at org.opensearch.common.util.concurrent.BaseFuture.setException(BaseFuture.java:178) [main/:?]
  1> 	at org.opensearch.common.util.concurrent.ListenableFuture.onFailure(ListenableFuture.java:149) [main/:?]
  1> 	at org.opensearch.action.StepListener.innerOnFailure(StepListener.java:84) [main/:?]
  1> 	at org.opensearch.core.action.NotifyOnceListener.onFailure(NotifyOnceListener.java:65) [opensearch-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
  1> 	at org.opensearch.action.ActionListenerResponseHandler.handleException(ActionListenerResponseHandler.java:75) [main/:?]
  1> 	at org.opensearch.telemetry.tracing.handler.TraceableTransportResponseHandler.handleException(TraceableTransportResponseHandler.java:81) [main/:?]
  1> 	at org.opensearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1505) [main/:?]
  1> 	at org.opensearch.transport.TransportService$8.run(TransportService.java:1357) [main/:?]
  1> 	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:932) [main/:?]
  1> 	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) [?:?]
  1> 	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) [?:?]
  1> 	at java.base/java.lang.Thread.run(Thread.java:1575) [?:?]
  1> Caused by: org.opensearch.transport.NodeDisconnectedException: [node_t1][127.0.0.1:42245][internal:index/shard/replication/get_segment_files] disconnected
  1> [2025-01-15T22:22:47,061][WARN ][o.o.i.r.OngoingSegmentReplications] [node_t1] Cancelling replications for allocationIds [nVfWi-IOQ06hLbKTnK05VQ]
  1> [2025-01-15T22:22:47,065][WARN ][o.o.c.r.a.AllocationService] [node_t0] Falling back to single shard assignment since batch mode disable or multiple custom allocators set
  1> [2025-01-15T22:22:47,064][ERROR][o.o.i.r.SegmentReplicationTargetService] [node_t3] [shardId [test-idx-1][0]] [replication id 7] Replication failed, timing data: {INIT=0, REPLICATING=0}
  1> org.opensearch.indices.replication.common.ReplicationFailedException: Segment Replication failed
  1> 	at org.opensearch.indices.replication.SegmentReplicator$2.onFailure(SegmentReplicator.java:154) [main/:?]
  1> 	at org.opensearch.core.action.ActionListener$1.onFailure(ActionListener.java:90) [opensearch-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
  1> 	at org.opensearch.action.ActionRunnable.onFailure(ActionRunnable.java:104) [main/:?]
  1> 	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:54) [main/:?]
  1> 	at org.opensearch.common.util.concurrent.OpenSearchExecutors$DirectExecutorService.execute(OpenSearchExecutors.java:341) [main/:?]
  1> 	at org.opensearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:120) [main/:?]
  1> 	at org.opensearch.common.util.concurrent.ListenableFuture.addListener(ListenableFuture.java:82) [main/:?]
  1> 	at org.opensearch.action.StepListener.whenComplete(StepListener.java:95) [main/:?]
  1> 	at org.opensearch.indices.replication.SegmentReplicationTarget.startReplication(SegmentReplicationTarget.java:179) [main/:?]
  1> 	at org.opensearch.indices.replication.SegmentReplicator.start(SegmentReplicator.java:137) [main/:?]
  1> 	at org.opensearch.indices.replication.SegmentReplicator$ReplicationRunner.doRun(SegmentReplicator.java:123) [main/:?]
  1> 	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:991) [main/:?]
  1> 	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [main/:?]
  1> 	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) [?:?]
  1> 	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) [?:?]
  1> 	at java.base/java.lang.Thread.run(Thread.java:1575) [?:?]
  1> Caused by: org.opensearch.transport.NodeNotConnectedException: [node_t1][127.0.0.1:42245] Node not connected
  1> 	at org.opensearch.transport.ClusterConnectionManager.getConnection(ClusterConnectionManager.java:223) ~[main/:?]
  1> 	at org.opensearch.test.transport.StubbableConnectionManager.getConnection(StubbableConnectionManager.java:93) ~[framework-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
  1> 	at org.opensearch.transport.TransportService.getConnection(TransportService.java:898) ~[main/:?]
  1> 	at org.opensearch.transport.TransportService.sendRequest(TransportService.java:857) ~[main/:?]
  1> 	at org.opensearch.indices.replication.PrimaryShardReplicationSource.getCheckpointMetadata(PrimaryShardReplicationSource.java:66) ~[main/:?]
  1> 	at org.opensearch.indices.replication.SegmentReplicationTarget.startReplication(SegmentReplicationTarget.java:177) ~[main/:?]
  1> 	... 7 more

With this change we are ensuring replication has finished before stopping the primary node.

Related Issues

Resolves #14328

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@github-actions github-actions bot added >test-failure Test failure from CI, local build, etc. autocut flaky-test Random test failure that succeeds on second run good first issue Good for newcomers Storage Issues and PRs relating to data and metadata storage Storage:Remote labels Jan 31, 2025
@skumawat2025 skumawat2025 marked this pull request as ready for review January 31, 2025 08:05
Copy link
Contributor

✅ Gradle check result for eed1d1b: SUCCESS

Copy link

codecov bot commented Jan 31, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 72.39%. Comparing base (8182bb0) to head (2fc5a6f).
Report is 64 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main   #17216      +/-   ##
============================================
+ Coverage     72.29%   72.39%   +0.10%     
- Complexity    65900    66014     +114     
============================================
  Files          5350     5350              
  Lines        306185   306208      +23     
  Branches      44373    44375       +2     
============================================
+ Hits         221347   221688     +341     
+ Misses        66670    66362     -308     
+ Partials      18168    18158      -10     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@opensearch-trigger-bot
Copy link
Contributor

This PR is stalled because it has been open for 30 days with no activity.

@opensearch-trigger-bot opensearch-trigger-bot bot added stalled Issues that have stalled and removed stalled Issues that have stalled labels Mar 21, 2025
Signed-off-by: skumwt <skumwt@amazon.com>
Copy link
Contributor

github-actions bot commented Apr 1, 2025

❌ Gradle check result for b73b853: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

github-actions bot commented Apr 1, 2025

❕ Gradle check result for 242ab74: UNSTABLE

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

Signed-off-by: Sandeep Kumawat <skumwt@amazon.com>
Signed-off-by: skumwt <skumwt@amazon.com>
Copy link
Contributor

github-actions bot commented Apr 2, 2025

❌ Gradle check result for 2fc5a6f: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-project-automation github-project-automation bot moved this to 👀 In review in Storage Project Board Apr 10, 2025
@ashking94
Copy link
Member

@skumawat2025 Lets get the build to green.

@linuxpi
Copy link
Contributor

linuxpi commented Apr 10, 2025

@ashking94 @skumawat2025 Looks like the failure is due to flaky tests. I have retried once

Copy link
Contributor

✅ Gradle check result for 2fc5a6f: SUCCESS

@ashking94 ashking94 merged commit 967eee1 into opensearch-project:main Apr 10, 2025
31 checks passed
@github-project-automation github-project-automation bot moved this from 👀 In review to ✅ Done in Storage Project Board Apr 10, 2025
rgsriram pushed a commit to rgsriram/OpenSearch that referenced this pull request Apr 15, 2025
…pensearch-project#17216)

* Fix flaky test SegmentReplicationIT.testReplicaAlreadyAtCheckpoint

Signed-off-by: skumwt <skumwt@amazon.com>

* Fix flaky test SegmentReplicationIT.testReplicaAlreadyAtCheckpoint

Signed-off-by: Sandeep Kumawat <skumwt@amazon.com>
Signed-off-by: skumwt <skumwt@amazon.com>

---------

Signed-off-by: skumwt <skumwt@amazon.com>
Signed-off-by: Sandeep Kumawat <skumwt@amazon.com>
Co-authored-by: skumwt <skumwt@amazon.com>
Signed-off-by: Sriram Ganesh <srignsh22@gmail.com>
Harsh-87 pushed a commit to Harsh-87/OpenSearch that referenced this pull request May 7, 2025
…pensearch-project#17216)

* Fix flaky test SegmentReplicationIT.testReplicaAlreadyAtCheckpoint

Signed-off-by: skumwt <skumwt@amazon.com>

* Fix flaky test SegmentReplicationIT.testReplicaAlreadyAtCheckpoint

Signed-off-by: Sandeep Kumawat <skumwt@amazon.com>
Signed-off-by: skumwt <skumwt@amazon.com>

---------

Signed-off-by: skumwt <skumwt@amazon.com>
Signed-off-by: Sandeep Kumawat <skumwt@amazon.com>
Co-authored-by: skumwt <skumwt@amazon.com>
Signed-off-by: Harsh Kothari <techarsh@amazon.com>
Harsh-87 pushed a commit to Harsh-87/OpenSearch that referenced this pull request May 7, 2025
…pensearch-project#17216)

* Fix flaky test SegmentReplicationIT.testReplicaAlreadyAtCheckpoint

Signed-off-by: skumwt <skumwt@amazon.com>

* Fix flaky test SegmentReplicationIT.testReplicaAlreadyAtCheckpoint

Signed-off-by: Sandeep Kumawat <skumwt@amazon.com>
Signed-off-by: skumwt <skumwt@amazon.com>

---------

Signed-off-by: skumwt <skumwt@amazon.com>
Signed-off-by: Sandeep Kumawat <skumwt@amazon.com>
Co-authored-by: skumwt <skumwt@amazon.com>
Signed-off-by: Harsh Kothari <techarsh@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
autocut flaky-test Random test failure that succeeds on second run good first issue Good for newcomers skip-changelog Storage:Remote Storage Issues and PRs relating to data and metadata storage >test-failure Test failure from CI, local build, etc.
Projects
Status: ✅ Done
Development

Successfully merging this pull request may close these issues.

[AUTOCUT] Gradle Check Flaky Test Report for SegmentReplicationIT
3 participants