Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resolve listener passed to WriteActionReplicasProxy#failShardIfNeeded when replica not failed #12181

Closed
wants to merge 1 commit into from

Conversation

mch2
Copy link
Member

@mch2 mch2 commented Feb 6, 2024

Description

On write operations we do not fail the replicas if the primary has been closed. However in this case we are never resolving the listener passed to ReplicasProxy. This can leave write requests open as pendingActions is never decref'd to 0 inside ReplicaOperation.

I found this while debugging #12114. The requests that are left open are start_recovery and retention_lease_sync. retention_lease_sync is a write operation that occurs as a step in recovery. The recovery is cancelled and primary shut down while there is an ongoing sync. The sync will hit this condition where the primary is shut down before the req is ack'd and closed on the node, leaving both the sync and original recovery req open.

leaving as draft for now will try and write a more thorough integ test of this case.

Related Issues

Resolves #12114

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Failing checks are inspected and point to the corresponding known issue(s) (See: Troubleshooting Failing Builds)
  • Commits are signed per the DCO using --signoff
  • Commit changes are listed out in CHANGELOG.md file (See: Changelog)
  • Public documentation issue/PR created

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

… when replica not failed

Signed-off-by: Marc Handalian <[email protected]>
@github-actions github-actions bot added bug Something isn't working flaky-test Random test failure that succeeds on second run Other labels Feb 6, 2024
Copy link
Contributor

github-actions bot commented Feb 6, 2024

Compatibility status:

Checks if related components are compatible with change cbe8679

Incompatible components

Incompatible components: [https://github.com/opensearch-project/performance-analyzer-rca.git, https://github.com/opensearch-project/performance-analyzer.git]

Skipped components

Compatible components

Compatible components: [https://github.com/opensearch-project/custom-codecs.git, https://github.com/opensearch-project/neural-search.git, https://github.com/opensearch-project/flow-framework.git, https://github.com/opensearch-project/observability.git, https://github.com/opensearch-project/cross-cluster-replication.git, https://github.com/opensearch-project/job-scheduler.git, https://github.com/opensearch-project/security-analytics.git, https://github.com/opensearch-project/opensearch-oci-object-storage.git, https://github.com/opensearch-project/geospatial.git, https://github.com/opensearch-project/notifications.git, https://github.com/opensearch-project/k-nn.git, https://github.com/opensearch-project/asynchronous-search.git, https://github.com/opensearch-project/reporting.git, https://github.com/opensearch-project/ml-commons.git, https://github.com/opensearch-project/sql.git, https://github.com/opensearch-project/common-utils.git, https://github.com/opensearch-project/index-management.git, https://github.com/opensearch-project/security.git, https://github.com/opensearch-project/anomaly-detection.git, https://github.com/opensearch-project/alerting.git]

Copy link
Contributor

github-actions bot commented Feb 6, 2024

❕ Gradle check result for cbe8679: UNSTABLE

  • TEST FAILURES:
      1 org.opensearch.remotestore.RemoteIndexPrimaryRelocationIT.testPrimaryRelocationWhileIndexing

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

Copy link

codecov bot commented Feb 6, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 71.36%. Comparing base (3cbf54e) to head (cbe8679).
Report is 84 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main   #12181      +/-   ##
============================================
- Coverage     71.40%   71.36%   -0.05%     
+ Complexity    59636    59635       -1     
============================================
  Files          4944     4944              
  Lines        280322   280323       +1     
  Branches      40728    40728              
============================================
- Hits         200175   200041     -134     
- Misses        63501    63701     +200     
+ Partials      16646    16581      -65     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@opensearch-trigger-bot
Copy link
Contributor

This PR is stalled because it has been open for 30 days with no activity.

@opensearch-trigger-bot opensearch-trigger-bot bot added stalled Issues that have stalled and removed stalled Issues that have stalled labels Mar 8, 2024
@kotwanikunal
Copy link
Member

Closing out this stalled draft PR. Please reopen if you are still looking to get this merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working flaky-test Random test failure that succeeds on second run Other
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Pending tasks not finished on node shutdown causing flaky tests
2 participants