Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bugs causing red indexes with remote indexes during translog upload & store recovery #10449

Merged
merged 5 commits into from
Oct 7, 2023

Conversation

ashking94
Copy link
Member

@ashking94 ashking94 commented Oct 6, 2023

Description

This PR fixes 2 issues which are causing red indexes while I performed stress testing -

  1. [BUG] Store recovery keeps failing for remote indexes if the remote store interaction fails and the shard is red (TranslogCorruptedException) #10400
  2. [BUG] Shard failure due to fail engine during trimUnreferencedTranslogFiles in InternalEngine for Remote indexes #10398

Related Issues

Resolves #10398, #10400.

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed per the DCO using --signoff
  • Commit changes are listed out in CHANGELOG.md file (See: Changelog)
  • Public documentation issue/PR created

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@github-actions
Copy link
Contributor

github-actions bot commented Oct 6, 2023

Compatibility status:

Checks if related components are compatible with change 97e5b34

Incompatible components

Incompatible components: [https://github.com/opensearch-project/security-analytics.git]

Skipped components

Compatible components

Compatible components: [https://github.com/opensearch-project/security.git, https://github.com/opensearch-project/alerting.git, https://github.com/opensearch-project/index-management.git, https://github.com/opensearch-project/anomaly-detection.git, https://github.com/opensearch-project/job-scheduler.git, https://github.com/opensearch-project/sql.git, https://github.com/opensearch-project/asynchronous-search.git, https://github.com/opensearch-project/common-utils.git, https://github.com/opensearch-project/observability.git, https://github.com/opensearch-project/k-nn.git, https://github.com/opensearch-project/reporting.git, https://github.com/opensearch-project/custom-codecs.git, https://github.com/opensearch-project/cross-cluster-replication.git, https://github.com/opensearch-project/opensearch-oci-object-storage.git, https://github.com/opensearch-project/performance-analyzer.git, https://github.com/opensearch-project/performance-analyzer-rca.git, https://github.com/opensearch-project/ml-commons.git, https://github.com/opensearch-project/geospatial.git, https://github.com/opensearch-project/notifications.git, https://github.com/opensearch-project/neural-search.git]

@github-actions
Copy link
Contributor

github-actions bot commented Oct 6, 2023

Gradle Check (Jenkins) Run Completed with:

@ashking94 ashking94 changed the title Disable async trim translog task for remote indexes Bug fixes for red indexes with remote indexes Oct 6, 2023
@ashking94 ashking94 changed the title Bug fixes for red indexes with remote indexes Fixes bugs causing red indexes with remote indexes during translog upload Oct 6, 2023
@ashking94
Copy link
Member Author

These issues have been found while doing stress testing and It's a bit tricky to simulate these kind of failures in integ tests. Still exploring if there is any way to write an IT for the above fixes.

@github-actions github-actions bot added bug Something isn't working Storage:Durability Issues and PRs related to the durability framework labels Oct 6, 2023
@ashking94 ashking94 marked this pull request as ready for review October 6, 2023 11:56
@gbbafna gbbafna added the backport 2.x Backport to 2.x branch label Oct 7, 2023
@github-actions
Copy link
Contributor

github-actions bot commented Oct 7, 2023

Gradle Check (Jenkins) Run Completed with:

@sachinpkale
Copy link
Member

Please add ITs around the changes.

@github-actions
Copy link
Contributor

github-actions bot commented Oct 7, 2023

Gradle Check (Jenkins) Run Completed with:

  • RESULT: UNSTABLE ❕
  • TEST FAILURES:
      1 org.opensearch.search.SearchWeightedRoutingIT.testSearchAggregationWithNetworkDisruption_FailOpenEnabled
      1 org.opensearch.cluster.allocation.ClusterRerouteIT.testDelayWithALargeAmountOfShards

@codecov
Copy link

codecov bot commented Oct 7, 2023

Codecov Report

Merging #10449 (97e5b34) into main (2bf63c9) will increase coverage by 0.08%.
Report is 3 commits behind head on main.
The diff coverage is 0.00%.

@@             Coverage Diff              @@
##               main   #10449      +/-   ##
============================================
+ Coverage     71.17%   71.26%   +0.08%     
- Complexity    58371    58425      +54     
============================================
  Files          4843     4843              
  Lines        275264   275266       +2     
  Branches      40076    40076              
============================================
+ Hits         195928   196172     +244     
+ Misses        62882    62663     -219     
+ Partials      16454    16431      -23     
Files Coverage Δ
...c/main/java/org/opensearch/index/IndexService.java 75.86% <ø> (+0.43%) ⬆️
...in/java/org/opensearch/index/shard/IndexShard.java 69.63% <0.00%> (-0.13%) ⬇️

... and 492 files with indirect coverage changes

Signed-off-by: Ashish Singh <[email protected]>
@ashking94
Copy link
Member Author

Please add ITs around the changes.

@sachinpkale Done. Pls take a relook. Ty!

@github-actions
Copy link
Contributor

github-actions bot commented Oct 7, 2023

Gradle Check (Jenkins) Run Completed with:

@ashking94
Copy link
Member Author

ashking94 commented Oct 7, 2023

Gradle Check (Jenkins) Run Completed with:

Failures -

org.opensearch.script.expression.MoreExpressionIT.testSpecialValueVariable {p0={"search.concurrent_segment_search.enabled":"true"}}
org.opensearch.index.IndexServiceTests.testAsyncTranslogTrimTaskOnClosedIndex

Of the above, #10079 is already a reported flaky test.
On the failure testAsyncTranslogTrimTaskOnClosedIndex, I have made no changes that should affect the specific test. The change made here is only applicable for remote indexes and does not affect the IndexServiceTests.testAsyncTranslogTrimTaskOnClosedIndex test. I am running the specific test on main to see if this fails after high iteration count or low. I have already this test now for more than 250+ iterations with this change and I dont see the failure yet.

Signed-off-by: Ashish Singh <[email protected]>
@ashking94 ashking94 changed the title Fixes bugs causing red indexes with remote indexes during translog upload & store recovery Fix bugs causing red indexes with remote indexes during translog upload & store recovery Oct 7, 2023
@github-actions
Copy link
Contributor

github-actions bot commented Oct 7, 2023

Gradle Check (Jenkins) Run Completed with:

@ashking94
Copy link
Member Author

ashking94 commented Oct 7, 2023

Gradle Check (Jenkins) Run Completed with:

Flaky tests - #10154, #10046

Signed-off-by: Ashish Singh <[email protected]>
@github-actions
Copy link
Contributor

github-actions bot commented Oct 7, 2023

Gradle Check (Jenkins) Run Completed with:

@sachinpkale sachinpkale merged commit 8bb11a6 into opensearch-project:main Oct 7, 2023
15 of 16 checks passed
opensearch-trigger-bot bot pushed a commit that referenced this pull request Oct 7, 2023
…ad & store recovery (#10449)

---------

Signed-off-by: Ashish Singh <[email protected]>
(cherry picked from commit 8bb11a6)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
opensearch-trigger-bot bot pushed a commit that referenced this pull request Oct 7, 2023
…ad & store recovery (#10449)

---------

Signed-off-by: Ashish Singh <[email protected]>
(cherry picked from commit 8bb11a6)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
gbbafna pushed a commit that referenced this pull request Oct 9, 2023
…ad & store recovery (#10449) (#10498)

---------


(cherry picked from commit 8bb11a6)

Signed-off-by: Ashish Singh <[email protected]>
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
gbbafna pushed a commit that referenced this pull request Oct 9, 2023
…ad & store recovery (#10449) (#10497)

---------


(cherry picked from commit 8bb11a6)

Signed-off-by: Ashish Singh <[email protected]>
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
deshsidd pushed a commit to deshsidd/OpenSearch that referenced this pull request Oct 9, 2023
austintlee pushed a commit to austintlee/OpenSearch that referenced this pull request Oct 23, 2023
shiv0408 pushed a commit to Gaurav614/OpenSearch that referenced this pull request Apr 25, 2024
…ad & store recovery (opensearch-project#10449)

---------

Signed-off-by: Ashish Singh <[email protected]>
Signed-off-by: Shivansh Arora <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Backport to 2.x branch backport 2.11 bug Something isn't working skip-changelog Storage:Durability Issues and PRs related to the durability framework
Projects
None yet
3 participants