-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature Request] Add RetryableListener in opensearch core #13157
Comments
@zane-neo This statement gives me great concern, what do you mean node failure can you reference issues where these are occurring? I do believe that OpenSearch nodes should be robust and handle failures gracefully. Retry logic can be useful when failures are expected - such as TCP making IP communication robust [1]. However, nodes drop/crashes should be exceptional. When a node is dropped or crashes the cluster is in a fragile state and retries can accelerate destabilization. Retries work when the resource being retried against is distributed so the request can be routed to a healthy alternative. There has been a much thought into how and when to implement retries I'd advocate reading this article and seeing how your scenario lines up [2].
|
@peternied Agree on that but what I meant is a high level statistics, we usually measure the service resilience by
I think it's a different scenario, TCP assume the hardware failure is temporary and retry can work since there must be alternative
True, so retry should only work in some cases, e.g. |
Thanks @zane-neo for creating the issue.
In case the target node is throwing these exceptions, it might just be so that they are removed from cluster as well. So, in that case simply retrying may not even help. It is still up to API implementation to handle specific transport errors and perform retries which is acceptable.
This is used specifically for reindex code path.
There is RetryingListener which is private to RetryableAction as the abstract class simplifies creating a retryable runnable action. Is that what you are looking for? Can you share the ml-commons API - transport API reference where you want to add this retry logic? |
Make sense, retrying with sending request to a dropped node doesn't help, but in a different scenario client side might able to choose another node to retry the request.
It's very similar to what I'm looking for but it's a private class. I have this commit in ml-commons API to retry the predict API: zane-neo/ml-commons@ad1f2ee |
It's fine to not add a new listener since using
|
Is your feature request related to a problem? Please describe
Currently there're cases that using NodeClient to dispatch requests to other nodes to execute. But it's possible that the node been dispatched crash during processing the request, and user will get exception. In a large cluster, such random node failure is pretty normal and the exception can be a considerable inconvenient for user.
In opensearch core, we have RetryableAction and RetryableTransportClient but the Action is designed to an abstract class so many cases a transportAction can not extend it if it has to extends another class. The RetryListener seems to have a dedicated purpose and not generic.
Example
Model inference is a core functionality of ml-commons plugin, since it's resource consuming so not all nodes will be serving models, usually we use
ml
type nodes to run models, when a non-ml node received an inference request, it will dispatch the request to a ml node, but any node can crash at at time, so the coordinator node can get a NodeNotConnectedException or NodeDisconnectedException exception.Without retry, the exception will be encapsulated and return to user(usually a 500 error), this can be improved easily by adding retry mechanism, the coordinator node can retry the request by sending it to another node thus user can get expected results.
Describe the solution you'd like
We can create a new RetryableListener in opensearch core, and when retry is necessary we can use this actionListener directly or override some methods in it, e.g. shouldRetry and retryFunction so the retry can be performed by leveraging these methods.
Related component
Other
Describe alternatives you've considered
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: