Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry bulk request to OpenSearch #572

Merged
merged 5 commits into from
Aug 22, 2024

Conversation

ykmr1224
Copy link
Collaborator

@ykmr1224 ykmr1224 commented Aug 16, 2024

Description

  • Retry bulk request to OpenSearch.
    • It retries only failed and retryable requests in the batch.
  • There already is retry for other requests, but it won't be applied to bulk API, since bulk request itself will return 200 even if each request were throttled.
  • This is to mitigate throttling when writing index to OpenSearch. When NONE refresh policy is used, bulk request will be responded quickly (even when the server is overloaded), and causes throttling.
  • Add rate limiter for bulk request #567 added rate limit, but we still need retry considering when the server is overloaded by other requests for short period of time.

Issues Resolved

List any issues this PR will resolve, e.g. Closes [...].

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Tomoyuki Morita <[email protected]>
Signed-off-by: Tomoyuki Morita <[email protected]>
@ykmr1224 ykmr1224 marked this pull request as ready for review August 16, 2024 20:54
.with(retryPolicy)
.get(() -> {
BulkResponse response = client.bulk(nextRequest.get(), options);
if (retryPolicy.getConfig().allowsRetries() && bulkItemErrorResultPredicate.test(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is retryPolicy.getConfig().allowsRetries()? is it configuratble?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is coming from existing config: retry.max_retries. When it is set to 0, retry is disabled and it would return false.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, do we need to managed max_retries manually? does the RetryPolicy already handle it automatically?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this logic, it checks if retry is enabled so not to generate next retryable request when retry is disabled.

BulkItemResponse[] bulkItemResponses = response.getItems();
BulkRequest nextRequest = new BulkRequest()
.setRefreshPolicy(request.getRefreshPolicy());
nextRequest.setParentTask(request.getParentTask());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is parent task?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That indicate the parent task associated with this request. I was not able to find good description from the OpenSearch doc. It looks working like a tag for requests when checking from _tasks API. (we can filter tasks by parent taskId)
Copying the same value from original request to keep it same.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not get it, tasks is OpenSearch internal concept, why the bulk request need to attach task info.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't care task info, but as it is an attribute in BulkRequst, just inherit the value from original request so it would be consistent with original request. (inheriting as much as possible from the original request)

return false;
}

private boolean isCreateConflict(BulkItemResponse itemResp) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is no 429 exception?

Copy link
Member

@vamsimanohar vamsimanohar Aug 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you rename the method. isCreateConflict is odd..does this mean the request is create and resulted in conflict. Is the intention to only retry requests with conflict failure?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, it consider other than Conflict response for Create request is retryable.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

itemResp.getOpType() == DocWriteRequest.OpType.CREATE && (itemResp.getFailure() == null
        || itemResp.getFailure().getStatus() == RestStatus.CONFLICT);
  }

Does conflict means throttled?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, CONFLICT means HTTP status 409 Conflict, which indicates same request came to the same document at the same time, and we shouldn't retry. This logic is coming from original implementation to see the bulk request succeeded or not. itemResp.getFailure() == null is not needed here, and I'll fix it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@vamsimanohar
Copy link
Member

If I understood correctly, whenever there is failure in bulk we will retry with exponential backoff...what is the retry policy earlier?

Why do we need separate backoff strategy apart from rate limiter?

Can you add parts of your design document to the PR description, so opensource users understand the change.

Signed-off-by: Tomoyuki Morita <[email protected]>
@ykmr1224
Copy link
Collaborator Author

If I understood correctly, whenever there is failure in bulk we will retry with exponential backoff...what is the retry policy earlier?

Originally, retry policy was effective only when whole request was failed. It was not applied when bulk request itself returned with 200, and each request failed.

Why do we need separate backoff strategy apart from rate limiter?

Can you add parts of your design document to the PR description, so opensource users understand the change.

I put some description in the PR, but which part is missing or unclear?
I think retry is anyway needed for robustness. (Even if rate limit is well implemented, there could be throttling or other temporary issue time to time)
The rate limit might require some improvement for longer term.

Signed-off-by: Tomoyuki Morita <[email protected]>
Signed-off-by: Tomoyuki Morita <[email protected]>
@vamsimanohar vamsimanohar merged commit 3db16ec into opensearch-project:main Aug 22, 2024
4 checks passed
opensearch-trigger-bot bot pushed a commit that referenced this pull request Aug 22, 2024
* Add retry to bulk request

Signed-off-by: Tomoyuki Morita <[email protected]>

* Retry only failed items

Signed-off-by: Tomoyuki Morita <[email protected]>

* Address comments

Signed-off-by: Tomoyuki Morita <[email protected]>

* Fix isCreateConflict

Signed-off-by: Tomoyuki Morita <[email protected]>

* Add and fix unit tests

Signed-off-by: Tomoyuki Morita <[email protected]>

---------

Signed-off-by: Tomoyuki Morita <[email protected]>
(cherry picked from commit 3db16ec)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
noCharger pushed a commit that referenced this pull request Aug 24, 2024
* Add retry to bulk request



* Retry only failed items



* Address comments



* Fix isCreateConflict



* Add and fix unit tests



---------


(cherry picked from commit 3db16ec)

Signed-off-by: Tomoyuki Morita <[email protected]>
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants