Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stream read pool and default s3 timeouts tuning #10912

Conversation

vikasvb90
Copy link
Contributor

@vikasvb90 vikasvb90 commented Oct 25, 2023

Description

On upload of large shards (large segments or more number of smaller segments), we noticed s3 errors. For large segments, api timeouts were observed and for large shards of small segments connection acquisition timeout was observed as below. Also, for small instances having low cpu cores, stream read pool capacity becomes low which processes upload requests slowly.
This change is to increase the capacity of stream read pool and increase connection acquisition and api timeouts.

[2023-10-25T07:47:37,139][ERROR][o.o.i.s.RemoteDirectory  ] [51844e1377c6a637328871dc96b00d5e] Failed to upload blob _3h_Lucene90_0.tim
software.amazon.awssdk.core.exception.SdkClientException: Failed to send multipart upload requests.
        at software.amazon.awssdk.core.exception.SdkClientException$BuilderImpl.build(SdkClientException.java:111)
        at software.amazon.awssdk.core.exception.SdkClientException.create(SdkClientException.java:47)
        at org.opensearch.repositories.s3.async.AsyncTransferManager.handleException(AsyncTransferManager.java:278)
        at org.opensearch.repositories.s3.async.AsyncTransferManager.lambda$handleExceptionOrResponse$14(AsyncTransferManager.java:227)
        at java.base/java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:934)
        at java.base/java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:911)
        at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510)
        at java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2162)
        at software.amazon.awssdk.utils.CompletableFutureUtils.lambda$forwardExceptionTo$0(CompletableFutureUtils.java:79)
        at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:863)
        at java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:841)

Related Issues

Resolves #[Issue number to be closed when this PR is merged]

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Failing checks are inspected and point to the corresponding known issue(s) (See: Troubleshooting Failing Builds)
  • Commits are signed per the DCO using --signoff
  • Commit changes are listed out in CHANGELOG.md file (See: Changelog)
  • Public documentation issue/PR created

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

github-actions bot commented Oct 25, 2023

Compatibility status:

Checks if related components are compatible with change 6ed696f

Incompatible components

Skipped components

Compatible components

Compatible components: [https://github.com/opensearch-project/asynchronous-search.git, https://github.com/opensearch-project/security-analytics.git, https://github.com/opensearch-project/observability.git, https://github.com/opensearch-project/reporting.git, https://github.com/opensearch-project/job-scheduler.git, https://github.com/opensearch-project/opensearch-oci-object-storage.git, https://github.com/opensearch-project/performance-analyzer.git, https://github.com/opensearch-project/custom-codecs.git, https://github.com/opensearch-project/common-utils.git, https://github.com/opensearch-project/performance-analyzer-rca.git, https://github.com/opensearch-project/ml-commons.git, https://github.com/opensearch-project/notifications.git, https://github.com/opensearch-project/anomaly-detection.git, https://github.com/opensearch-project/k-nn.git, https://github.com/opensearch-project/index-management.git, https://github.com/opensearch-project/neural-search.git, https://github.com/opensearch-project/security.git, https://github.com/opensearch-project/cross-cluster-replication.git, https://github.com/opensearch-project/alerting.git, https://github.com/opensearch-project/geospatial.git, https://github.com/opensearch-project/sql.git]

@vikasvb90 vikasvb90 force-pushed the s3_timeouts_and_stream_read_pool_tuning branch from f0a373e to 17013ad Compare October 25, 2023 11:33
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@kkmr
Copy link
Contributor

kkmr commented Nov 14, 2023

Also, do we have any perf results showing how things have improved?

@vikasvb90
Copy link
Contributor Author

vikasvb90 commented Nov 15, 2023

Also, do we have any perf results showing how things have improved?

This is not about perf. This is to handle upload bursts occurring due to large file uploads. Increasing timeouts is a temporary change till we use a separate dedicated slow client which can handle special cases by putting them behind a producer-consumer queue and processing them slowly without impacting critical uploads.

For large uploads happening on a smaller instance, this timeout may still not be sufficient.

Copy link
Contributor

❌ Gradle check result for 3ba52e3: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@vikasvb90 vikasvb90 added the v2.12.0 Issues and PRs related to version 2.12.0 label Nov 25, 2023
@vikasvb90 vikasvb90 force-pushed the s3_timeouts_and_stream_read_pool_tuning branch from 3ba52e3 to 6ed696f Compare November 26, 2023 13:17
Copy link
Contributor

✅ Gradle check result for 6ed696f: SUCCESS

Copy link

codecov bot commented Nov 26, 2023

Codecov Report

Attention: 13 lines in your changes are missing coverage. Please review.

Comparison is base (5bb6cae) 71.21% compared to head (6ed696f) 71.36%.

Files Patch % Lines
...opensearch/repositories/s3/S3RepositoryPlugin.java 0.00% 11 Missing ⚠️
...ensearch/repositories/s3/StatsMetricPublisher.java 66.66% 2 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main   #10912      +/-   ##
============================================
+ Coverage     71.21%   71.36%   +0.14%     
- Complexity    58926    59002      +76     
============================================
  Files          4890     4890              
  Lines        277434   277447      +13     
  Branches      40313    40313              
============================================
+ Hits         197567   197989     +422     
+ Misses        63464    62992     -472     
- Partials      16403    16466      +63     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@gbbafna gbbafna merged commit 74b2d7d into opensearch-project:main Nov 26, 2023
28 of 29 checks passed
@vikasvb90 vikasvb90 added backport PRs or issues specific to backporting features or enhancments backport 2.x Backport to 2.x branch and removed backport PRs or issues specific to backporting features or enhancments labels Dec 3, 2023
opensearch-trigger-bot bot pushed a commit that referenced this pull request Dec 3, 2023
Signed-off-by: vikasvb90 <[email protected]>
(cherry picked from commit 74b2d7d)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
gbbafna pushed a commit that referenced this pull request Dec 4, 2023
(cherry picked from commit 74b2d7d)

Signed-off-by: vikasvb90 <[email protected]>
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
@@ -198,14 +198,14 @@ final class S3ClientSettings {
static final Setting.AffixSetting<Integer> MAX_CONNECTIONS_SETTING = Setting.affixKeySetting(
PREFIX,
"max_connections",
key -> Setting.intSetting(key, 100, Property.NodeScope)
key -> Setting.intSetting(key, 500, Property.NodeScope)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel these should made a function of cores. For eg: instances with fewer cores will run out of higher read timeouts with too many connections since we have limited threads to process

cc: @vikasvb90

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it should be but we will have to first fix the blocking reads from disk happening in stream reader pool and then tune all timeouts as well as this connection count based on benchmarks.

fahadshamiinsta pushed a commit to fahadshamiinsta/OpenSearch270 that referenced this pull request Dec 4, 2023
deshsidd pushed a commit to deshsidd/OpenSearch that referenced this pull request Dec 11, 2023
rayshrey pushed a commit to rayshrey/OpenSearch that referenced this pull request Mar 18, 2024
shiv0408 pushed a commit to Gaurav614/OpenSearch that referenced this pull request Apr 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Backport to 2.x branch skip-changelog v2.12.0 Issues and PRs related to version 2.12.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants