Retry bulk request to OpenSearch #572

ykmr1224 · 2024-08-16T19:53:03Z

Description

Retry bulk request to OpenSearch.
- It retries only failed and retryable requests in the batch.
There already is retry for other requests, but it won't be applied to bulk API, since bulk request itself will return 200 even if each request were throttled.
This is to mitigate throttling when writing index to OpenSearch. When NONE refresh policy is used, bulk request will be responded quickly (even when the server is overloaded), and causes throttling.
Add rate limiter for bulk request #567 added rate limit, but we still need retry considering when the server is overloaded by other requests for short period of time.

Issues Resolved

List any issues this PR will resolve, e.g. Closes [...].

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Tomoyuki Morita <[email protected]>

flint-core/src/main/scala/org/opensearch/flint/core/storage/OpenSearchBulkRetryWrapper.java

penghuo · 2024-08-16T21:47:28Z

flint-core/src/main/scala/org/opensearch/flint/core/storage/OpenSearchBulkRetryWrapper.java

+          .with(retryPolicy)
+          .get(() -> {
+            BulkResponse response = client.bulk(nextRequest.get(), options);
+            if (retryPolicy.getConfig().allowsRetries() && bulkItemErrorResultPredicate.test(


what is retryPolicy.getConfig().allowsRetries()? is it configuratble?

Yes, this is coming from existing config: retry.max_retries. When it is set to 0, retry is disabled and it would return false.

nit, do we need to managed max_retries manually? does the RetryPolicy already handle it automatically?

In this logic, it checks if retry is enabled so not to generate next retryable request when retry is disabled.

flint-core/src/main/scala/org/opensearch/flint/core/storage/OpenSearchBulkRetryWrapper.java

penghuo · 2024-08-16T21:56:47Z

flint-core/src/main/scala/org/opensearch/flint/core/storage/OpenSearchBulkRetryWrapper.java

+    BulkItemResponse[] bulkItemResponses = response.getItems();
+    BulkRequest nextRequest = new BulkRequest()
+        .setRefreshPolicy(request.getRefreshPolicy());
+    nextRequest.setParentTask(request.getParentTask());


what is parent task?

That indicate the parent task associated with this request. I was not able to find good description from the OpenSearch doc. It looks working like a tag for requests when checking from _tasks API. (we can filter tasks by parent taskId)
Copying the same value from original request to keep it same.

I did not get it, tasks is OpenSearch internal concept, why the bulk request need to attach task info.

We don't care task info, but as it is an attribute in BulkRequst, just inherit the value from original request so it would be consistent with original request. (inheriting as much as possible from the original request)

...core/src/main/scala/org/opensearch/flint/core/http/handler/BulkItemErrorResultPredicate.java

penghuo · 2024-08-16T22:01:02Z

...core/src/main/scala/org/opensearch/flint/core/http/handler/BulkItemErrorResultPredicate.java

+    return false;
+  }
+
+  private boolean isCreateConflict(BulkItemResponse itemResp) {


there is no 429 exception?

Can you rename the method. isCreateConflict is odd..does this mean the request is create and resulted in conflict. Is the intention to only retry requests with conflict failure?

Here, it consider other than Conflict response for Create request is retryable.

itemResp.getOpType() == DocWriteRequest.OpType.CREATE && (itemResp.getFailure() == null || itemResp.getFailure().getStatus() == RestStatus.CONFLICT); }

Does conflict means throttled?

Here, CONFLICT means HTTP status 409 Conflict, which indicates same request came to the same document at the same time, and we shouldn't retry. This logic is coming from original implementation to see the bulk request succeeded or not. itemResp.getFailure() == null is not needed here, and I'll fix it.

flint-core/src/main/scala/org/opensearch/flint/core/http/FlintRetryOptions.java

vamsimanohar · 2024-08-16T23:21:34Z

If I understood correctly, whenever there is failure in bulk we will retry with exponential backoff...what is the retry policy earlier?

Why do we need separate backoff strategy apart from rate limiter?

Can you add parts of your design document to the PR description, so opensource users understand the change.

Signed-off-by: Tomoyuki Morita <[email protected]>

ykmr1224 · 2024-08-17T00:05:24Z

If I understood correctly, whenever there is failure in bulk we will retry with exponential backoff...what is the retry policy earlier?

Originally, retry policy was effective only when whole request was failed. It was not applied when bulk request itself returned with 200, and each request failed.

Why do we need separate backoff strategy apart from rate limiter?

Can you add parts of your design document to the PR description, so opensource users understand the change.

I put some description in the PR, but which part is missing or unclear?
I think retry is anyway needed for robustness. (Even if rate limit is well implemented, there could be throttling or other temporary issue time to time)
The rate limit might require some improvement for longer term.

Signed-off-by: Tomoyuki Morita <[email protected]>

* Add retry to bulk request Signed-off-by: Tomoyuki Morita <[email protected]> * Retry only failed items Signed-off-by: Tomoyuki Morita <[email protected]> * Address comments Signed-off-by: Tomoyuki Morita <[email protected]> * Fix isCreateConflict Signed-off-by: Tomoyuki Morita <[email protected]> * Add and fix unit tests Signed-off-by: Tomoyuki Morita <[email protected]> --------- Signed-off-by: Tomoyuki Morita <[email protected]> (cherry picked from commit 3db16ec) Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

* Add retry to bulk request * Retry only failed items * Address comments * Fix isCreateConflict * Add and fix unit tests --------- (cherry picked from commit 3db16ec) Signed-off-by: Tomoyuki Morita <[email protected]> Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

ykmr1224 added 2 commits August 16, 2024 12:50

Add retry to bulk request

1b99821

Signed-off-by: Tomoyuki Morita <[email protected]>

Retry only failed items

cee1c20

Signed-off-by: Tomoyuki Morita <[email protected]>

ykmr1224 marked this pull request as ready for review August 16, 2024 20:54

ykmr1224 requested review from dai-chen, rupal-bq, vamsimanohar, penghuo, seankao-az, anirudha, kaituo and YANG-DB as code owners August 16, 2024 20:54

penghuo reviewed Aug 16, 2024

View reviewed changes

vamsimanohar reviewed Aug 16, 2024

View reviewed changes

flint-core/src/main/scala/org/opensearch/flint/core/http/FlintRetryOptions.java Show resolved Hide resolved

vamsimanohar closed this Aug 16, 2024

vamsimanohar reopened this Aug 16, 2024

Address comments

c06112f

Signed-off-by: Tomoyuki Morita <[email protected]>

ykmr1224 added 2 commits August 20, 2024 10:46

Fix isCreateConflict

f9b53b1

Signed-off-by: Tomoyuki Morita <[email protected]>

Add and fix unit tests

dcc021e

Signed-off-by: Tomoyuki Morita <[email protected]>

vamsimanohar approved these changes Aug 21, 2024

View reviewed changes

vamsimanohar merged commit 3db16ec into opensearch-project:main Aug 22, 2024
4 checks passed

vamsimanohar added the backport 0.5-nexus label Aug 22, 2024

vamsimanohar assigned ykmr1224 Aug 22, 2024

opensearch-trigger-bot bot mentioned this pull request Aug 22, 2024

[Backport 0.5-nexus] Retry bulk request to OpenSearch #594

Merged

ykmr1224 mentioned this pull request Sep 10, 2024

[FEATURE] Avoid throttling when writing data to index #640

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retry bulk request to OpenSearch #572

Retry bulk request to OpenSearch #572

ykmr1224 commented Aug 16, 2024 •

edited

Loading

penghuo Aug 16, 2024

ykmr1224 Aug 16, 2024

penghuo Aug 17, 2024

ykmr1224 Aug 20, 2024

penghuo Aug 16, 2024

ykmr1224 Aug 16, 2024

penghuo Aug 17, 2024

ykmr1224 Aug 20, 2024

penghuo Aug 16, 2024

vamsimanohar Aug 16, 2024 •

edited

Loading

ykmr1224 Aug 16, 2024

penghuo Aug 17, 2024

ykmr1224 Aug 20, 2024

ykmr1224 Aug 20, 2024

vamsimanohar commented Aug 16, 2024

ykmr1224 commented Aug 17, 2024

Retry bulk request to OpenSearch #572

Retry bulk request to OpenSearch #572

Conversation

ykmr1224 commented Aug 16, 2024 • edited Loading

Description

Issues Resolved

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vamsimanohar Aug 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vamsimanohar commented Aug 16, 2024

ykmr1224 commented Aug 17, 2024

ykmr1224 commented Aug 16, 2024 •

edited

Loading

vamsimanohar Aug 16, 2024 •

edited

Loading