Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix flaky Segment Replication test testStartReplicaAfterPrimaryIndexesDocs. #5722

Merged
merged 2 commits into from
Jan 7, 2023

Conversation

mch2
Copy link
Member

@mch2 mch2 commented Jan 6, 2023

Signed-off-by: Marc Handalian [email protected]

Description

Fix flaky SegmentReplicationIT test testStartReplicaAfterPrimaryIndexesDocs.
This test was failing because we are validating post recovery if a shard is able to perform segrep while also performing validation of a passed in checkopint. In the post recovery scenario this checkpoint is passed as empty, yet the shard will be ahead of this empty checkpoint after docs are indexed and fail validation. This change differentiates shard validation from checkpoint validation and only performs the former post recovery.

This PR also introduces validation of the engine type before SR is invoked. This is to ensure NRTReplicationEngine is properly loaded on the replica. Without this SR would continue and blow up at a later stage with an index corruption error. This happens a lot when MockInternalEngine is randomly loaded in tests as this method by default returns true.

I've re-run this test 1k times in intellij without failure.

Issues Resolved

related #5669

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed per the DCO using --signoff
  • Commit changes are listed out in CHANGELOG.md file (See: Changelog)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@github-actions
Copy link
Contributor

github-actions bot commented Jan 6, 2023

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

github-actions bot commented Jan 6, 2023

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

github-actions bot commented Jan 6, 2023

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

github-actions bot commented Jan 6, 2023

Gradle Check (Jenkins) Run Completed with:

  • RESULT: UNSTABLE ❕
  • TEST FAILURES:
      1 org.opensearch.client.RestClientMultipleHostsIntegTests.testNodeSelector

@codecov-commenter
Copy link

codecov-commenter commented Jan 6, 2023

Codecov Report

Merging #5722 (f8e001f) into main (b3e25bb) will increase coverage by 0.04%.
The diff coverage is 81.25%.

@@             Coverage Diff              @@
##               main    #5722      +/-   ##
============================================
+ Coverage     71.05%   71.10%   +0.04%     
+ Complexity    58744    58741       -3     
============================================
  Files          4766     4766              
  Lines        280030   280030              
  Branches      40434    40434              
============================================
+ Hits         198988   199117     +129     
+ Misses        64851    64679     -172     
- Partials      16191    16234      +43     
Impacted Files Coverage Δ
...ch/indices/cluster/IndicesClusterStateService.java 69.66% <0.00%> (+1.02%) ⬆️
...in/java/org/opensearch/index/shard/IndexShard.java 71.16% <86.66%> (+0.96%) ⬆️
...n/indices/forcemerge/ForceMergeRequestBuilder.java 0.00% <0.00%> (-75.00%) ⬇️
.../indices/forcemerge/TransportForceMergeAction.java 25.00% <0.00%> (-75.00%) ⬇️
.../java/org/opensearch/node/NodeClosedException.java 50.00% <0.00%> (-50.00%) ⬇️
...ava/org/opensearch/action/NoSuchNodeException.java 0.00% <0.00%> (-50.00%) ⬇️
...opensearch/persistent/PersistentTasksExecutor.java 22.22% <0.00%> (-44.45%) ⬇️
...adcast/BroadcastShardOperationFailedException.java 55.55% <0.00%> (-44.45%) ⬇️
.../admin/cluster/reroute/ClusterRerouteResponse.java 60.00% <0.00%> (-40.00%) ⬇️
...luster/routing/allocation/RoutingExplanations.java 62.06% <0.00%> (-37.94%) ⬇️
... and 493 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@github-actions
Copy link
Contributor

github-actions bot commented Jan 6, 2023

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

github-actions bot commented Jan 6, 2023

Gradle Check (Jenkins) Run Completed with:

@mch2 mch2 requested review from gbbafna and VachaShah as code owners January 6, 2023 18:28
CHANGELOG.md Outdated Show resolved Hide resolved
public final boolean shouldProcessCheckpoint(ReplicationCheckpoint requestCheckpoint) {
if (state().equals(IndexShardState.STARTED) == false) {
logger.trace(() -> new ParameterizedMessage("Ignoring new replication checkpoint - shard is not started {}", state()));
public boolean isSegmentReplicationAllowed() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: isSegmentReplicationAllowed does not sound that it is meant for target/replica. isSegRepSyncAllowed or isSegRepAllowedOnReplica ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is not just invoked for replicas, even after primary is recovered it will be invoked from IndicesClusterStateService#forceSegmentReplication.

*
* @param requestCheckpoint received checkpoint that is checked for processing
* @return true if checkpoint should be processed
* Checks if this shard is able to perform segment replication.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

  • Checks if this shard is able to perform segment replication.

Checks if this target shard should start round of segment replication with primary ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated - I'm intentionally not mentioning primary here as these paths will be reused for remote store replication.

@github-actions
Copy link
Contributor

github-actions bot commented Jan 6, 2023

Gradle Check (Jenkins) Run Completed with:

@dreamer-89
Copy link
Member

Gradle Check (Jenkins) Run Completed with:

Does not seem related, refiring the gradle check.

REPRODUCE WITH: ./gradlew ':server:test' --tests "org.opensearch.action.support.replication.TransportReplicationActionTests.testClosedIndexOnReroute" -Dtests.seed=353F0B5144D33354 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=fr-LU -Dtests.timezone=America/Argentina/San_Juan -Druntime.java=19

org.opensearch.action.support.replication.TransportReplicationActionTests > testClosedIndexOnReroute FAILED
    java.lang.IllegalStateException: No local node found. Is the node started?
        at __randomizedtesting.SeedInfo.seed([353F0B5144D33354:59C1A9D22741BA1F]:0)
        at org.opensearch.cluster.service.ClusterService.localNode(ClusterService.java:156)
        at org.opensearch.action.support.replication.TransportReplicationAction$ReroutePhase.<init>(TransportReplicationAction.java:890)
        at org.opensearch.action.support.replication.TransportReplicationAction$ReroutePhase.<init>(TransportReplicationAction.java:883)
        at org.opensearch.action.support.replication.TransportReplicationActionTests.testClosedIndexOnReroute(TransportReplicationActionTests.java:640)

@github-actions
Copy link
Contributor

github-actions bot commented Jan 6, 2023

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

github-actions bot commented Jan 6, 2023

Gradle Check (Jenkins) Run Completed with:

@dreamer-89
Copy link
Member

Gradle Check (Jenkins) Run Completed with:

This is fixed in #5737. @mch2 : Can you please rebase your changes against main ?

mch2 added 2 commits January 6, 2023 15:03
This test was failing because we are validating post recovery if a shard is able to perform segrep while also performing validation if a passed in checkopint.  In the post recovery test this checkpoint is always empty, yet the shard will be ahead of this checkpoint after docs are indexed.  This change differentiates shard validation from checkpoint validation.

Signed-off-by: Marc Handalian <[email protected]>

Fix spotless.

Signed-off-by: Marc Handalian <[email protected]>

Fix testIsSegmentReplicationAllowed_WrongEngineType.

Signed-off-by: Marc Handalian <[email protected]>

Update warn logs in isSegmentReplicationAllowed.

Signed-off-by: Marc Handalian <[email protected]>
Signed-off-by: Marc Handalian <[email protected]>
@github-actions
Copy link
Contributor

github-actions bot commented Jan 6, 2023

Gradle Check (Jenkins) Run Completed with:

  • RESULT: UNSTABLE ❕
  • TEST FAILURES:
      1 org.opensearch.indices.replication.SegmentReplicationIT.testRestartPrimary

@dreamer-89 dreamer-89 merged commit 85f4149 into opensearch-project:main Jan 7, 2023
@mch2 mch2 deleted the sr-validation branch January 7, 2023 00:09
mch2 added a commit to mch2/OpenSearch that referenced this pull request Jan 19, 2023
…sDocs. (opensearch-project#5722)

* Fix flaky SR test testStartReplicaAfterPrimaryIndexesDocs.

This test was failing because we are validating post recovery if a shard is able to perform segrep while also performing validation if a passed in checkopint.  In the post recovery test this checkpoint is always empty, yet the shard will be ahead of this checkpoint after docs are indexed.  This change differentiates shard validation from checkpoint validation.

Signed-off-by: Marc Handalian <[email protected]>

Fix spotless.

Signed-off-by: Marc Handalian <[email protected]>

Fix testIsSegmentReplicationAllowed_WrongEngineType.

Signed-off-by: Marc Handalian <[email protected]>

Update warn logs in isSegmentReplicationAllowed.

Signed-off-by: Marc Handalian <[email protected]>

* PR feedback.

Signed-off-by: Marc Handalian <[email protected]>

Signed-off-by: Marc Handalian <[email protected]>
dreamer-89 added a commit that referenced this pull request Jan 20, 2023
….x. (#5945)

* [Segment Replication] Add snapshot and restore tests for segment replication feature (#3993)

* [Segment Replication] Add snapshots tests with segment replication enabled

Signed-off-by: Suraj Singh <[email protected]>

* Fix spotless failures

Signed-off-by: Suraj Singh <[email protected]>

* Add changelog entry, address review comments, add failover test

Signed-off-by: Suraj Singh <[email protected]>

* Fix spotless failures

Signed-off-by: Suraj Singh <[email protected]>

* Address review comments 2

Signed-off-by: Suraj Singh <[email protected]>

Signed-off-by: Suraj Singh <[email protected]>

* Remove changelog update.

Signed-off-by: Marc Handalian <[email protected]>

* Mute flaky test testStartReplicaAfterPrimaryIndexesDocs. (#5714)

Signed-off-by: Marc Handalian <[email protected]>

Signed-off-by: Marc Handalian <[email protected]>

* Fix flaky Segment Replication test testStartReplicaAfterPrimaryIndexesDocs. (#5722)

* Fix flaky SR test testStartReplicaAfterPrimaryIndexesDocs.

This test was failing because we are validating post recovery if a shard is able to perform segrep while also performing validation if a passed in checkopint.  In the post recovery test this checkpoint is always empty, yet the shard will be ahead of this checkpoint after docs are indexed.  This change differentiates shard validation from checkpoint validation.

Signed-off-by: Marc Handalian <[email protected]>

Fix spotless.

Signed-off-by: Marc Handalian <[email protected]>

Fix testIsSegmentReplicationAllowed_WrongEngineType.

Signed-off-by: Marc Handalian <[email protected]>

Update warn logs in isSegmentReplicationAllowed.

Signed-off-by: Marc Handalian <[email protected]>

* PR feedback.

Signed-off-by: Marc Handalian <[email protected]>

Signed-off-by: Marc Handalian <[email protected]>

* [Segment Replication] Mute flaky tests (#5739)

Signed-off-by: Suraj Singh <[email protected]>

Signed-off-by: Suraj Singh <[email protected]>

* [Segment Replication] Mute flaky tests (#5742)

Signed-off-by: Suraj Singh <[email protected]>

Signed-off-by: Suraj Singh <[email protected]>

* Fix spotless.

Signed-off-by: Marc Handalian <[email protected]>

* Muting flaky SegmentReplication ITs. (#5700)

Signed-off-by: Marc Handalian <[email protected]>

Signed-off-by: Marc Handalian <[email protected]>

Signed-off-by: Suraj Singh <[email protected]>
Signed-off-by: Marc Handalian <[email protected]>
Co-authored-by: Suraj Singh <[email protected]>
kotwanikunal pushed a commit that referenced this pull request Jan 25, 2023
….x. (#5945)

* [Segment Replication] Add snapshot and restore tests for segment replication feature (#3993)

* [Segment Replication] Add snapshots tests with segment replication enabled

Signed-off-by: Suraj Singh <[email protected]>

* Fix spotless failures

Signed-off-by: Suraj Singh <[email protected]>

* Add changelog entry, address review comments, add failover test

Signed-off-by: Suraj Singh <[email protected]>

* Fix spotless failures

Signed-off-by: Suraj Singh <[email protected]>

* Address review comments 2

Signed-off-by: Suraj Singh <[email protected]>

Signed-off-by: Suraj Singh <[email protected]>

* Remove changelog update.

Signed-off-by: Marc Handalian <[email protected]>

* Mute flaky test testStartReplicaAfterPrimaryIndexesDocs. (#5714)

Signed-off-by: Marc Handalian <[email protected]>

Signed-off-by: Marc Handalian <[email protected]>

* Fix flaky Segment Replication test testStartReplicaAfterPrimaryIndexesDocs. (#5722)

* Fix flaky SR test testStartReplicaAfterPrimaryIndexesDocs.

This test was failing because we are validating post recovery if a shard is able to perform segrep while also performing validation if a passed in checkopint.  In the post recovery test this checkpoint is always empty, yet the shard will be ahead of this checkpoint after docs are indexed.  This change differentiates shard validation from checkpoint validation.

Signed-off-by: Marc Handalian <[email protected]>

Fix spotless.

Signed-off-by: Marc Handalian <[email protected]>

Fix testIsSegmentReplicationAllowed_WrongEngineType.

Signed-off-by: Marc Handalian <[email protected]>

Update warn logs in isSegmentReplicationAllowed.

Signed-off-by: Marc Handalian <[email protected]>

* PR feedback.

Signed-off-by: Marc Handalian <[email protected]>

Signed-off-by: Marc Handalian <[email protected]>

* [Segment Replication] Mute flaky tests (#5739)

Signed-off-by: Suraj Singh <[email protected]>

Signed-off-by: Suraj Singh <[email protected]>

* [Segment Replication] Mute flaky tests (#5742)

Signed-off-by: Suraj Singh <[email protected]>

Signed-off-by: Suraj Singh <[email protected]>

* Fix spotless.

Signed-off-by: Marc Handalian <[email protected]>

* Muting flaky SegmentReplication ITs. (#5700)

Signed-off-by: Marc Handalian <[email protected]>

Signed-off-by: Marc Handalian <[email protected]>

Signed-off-by: Suraj Singh <[email protected]>
Signed-off-by: Marc Handalian <[email protected]>
Co-authored-by: Suraj Singh <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants