Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Add RetryableListener in opensearch core #13157

Closed
zane-neo opened this issue Apr 11, 2024 · 6 comments
Closed

[Feature Request] Add RetryableListener in opensearch core #13157

zane-neo opened this issue Apr 11, 2024 · 6 comments
Labels
enhancement Enhancement or improvement to existing feature or request Other

Comments

@zane-neo
Copy link
Contributor

Is your feature request related to a problem? Please describe

Currently there're cases that using NodeClient to dispatch requests to other nodes to execute. But it's possible that the node been dispatched crash during processing the request, and user will get exception. In a large cluster, such random node failure is pretty normal and the exception can be a considerable inconvenient for user.
In opensearch core, we have RetryableAction and RetryableTransportClient but the Action is designed to an abstract class so many cases a transportAction can not extend it if it has to extends another class. The RetryListener seems to have a dedicated purpose and not generic.

Example
Model inference is a core functionality of ml-commons plugin, since it's resource consuming so not all nodes will be serving models, usually we use ml type nodes to run models, when a non-ml node received an inference request, it will dispatch the request to a ml node, but any node can crash at at time, so the coordinator node can get a NodeNotConnectedException or NodeDisconnectedException exception.
Without retry, the exception will be encapsulated and return to user(usually a 500 error), this can be improved easily by adding retry mechanism, the coordinator node can retry the request by sending it to another node thus user can get expected results.

Describe the solution you'd like

We can create a new RetryableListener in opensearch core, and when retry is necessary we can use this actionListener directly or override some methods in it, e.g. shouldRetry and retryFunction so the retry can be performed by leveraging these methods.

Related component

Other

Describe alternatives you've considered

No response

Additional context

No response

@zane-neo zane-neo added enhancement Enhancement or improvement to existing feature or request untriaged labels Apr 11, 2024
@github-actions github-actions bot added the Other label Apr 11, 2024
@peternied
Copy link
Member

In a large cluster, such random node failure is pretty normal and the exception can be a considerable inconvenient for user.

@zane-neo This statement gives me great concern, what do you mean node failure can you reference issues where these are occurring? I do believe that OpenSearch nodes should be robust and handle failures gracefully. Retry logic can be useful when failures are expected - such as TCP making IP communication robust [1]. However, nodes drop/crashes should be exceptional.

When a node is dropped or crashes the cluster is in a fragile state and retries can accelerate destabilization. Retries work when the resource being retried against is distributed so the request can be routed to a healthy alternative. There has been a much thought into how and when to implement retries I'd advocate reading this article and seeing how your scenario lines up [2].

@zane-neo
Copy link
Contributor Author

zane-neo commented Apr 12, 2024

I do believe that OpenSearch nodes should be robust and handle failures gracefully.

@peternied Agree on that but what I meant is a high level statistics, we usually measure the service resilience by nine numbers: https://en.wikipedia.org/wiki/High_availability#Percentage_calculation. From the link let's say a service guarantee 4 nines, it still has 8.64 seconds downtime everyday in average, so retry can help on this case.

Retry logic can be useful when failures are expected - such as TCP making IP communication robust.

I think it's a different scenario, TCP assume the hardware failure is temporary and retry can work since there must be alternative routes to target. HTTP can not make this assumption since not all services are distributed so we don't have a default retry mechanism in HTTP, that's why we need to implement this manually. There're quite a lot retry framework out there and here are some of them: https://engineering.ripple.com/selecting-retry-frameworks-for-your-java-project/.

When a node is dropped or crashes the cluster is in a fragile state and retries can accelerate destabilization.

True, so retry should only work in some cases, e.g. NodeNotConnectedException and NodeDisconnectedException but not CircuitBreakerException or OutOfMemoryException. Mainly retry is composed in two part: retry condition and retry action, I'm suggesting we provide a template listener in OpenSearch core and the default retry condition only include those exceptions should be retried, and user can reuse this by overriding some method of this listener to satisfy their use cases.

@shwetathareja
Copy link
Member

shwetathareja commented Apr 18, 2024

Thanks @zane-neo for creating the issue.

so the coordinator node can get a NodeNotConnectedException or NodeDisconnectedException exception.

In case the target node is throwing these exceptions, it might just be so that they are removed from cluster as well. So, in that case simply retrying may not even help.

It is still up to API implementation to handle specific transport errors and perform retries which is acceptable.

The RetryListener seems to have a dedicated purpose and not generic.

This is used specifically for reindex code path.

In opensearch core, we have RetryableAction

There is RetryingListener which is private to RetryableAction as the abstract class simplifies creating a retryable runnable action. Is that what you are looking for? Can you share the ml-commons API - transport API reference where you want to add this retry logic?

@zane-neo
Copy link
Contributor Author

In case the target node is throwing these exceptions, it might just be so that they are removed from cluster as well. So, in that case simply retrying may not even help.

Make sense, retrying with sending request to a dropped node doesn't help, but in a different scenario client side might able to choose another node to retry the request.

Is that what you are looking for? Can you share the ml-commons API - transport API reference where you want to add this retry logic?

It's very similar to what I'm looking for but it's a private class. I have this commit in ml-commons API to retry the predict API: zane-neo/ml-commons@ad1f2ee

@peternied
Copy link
Member

[Triage - attendees 1 2 3 4 5 6 7 8]
@zane-neo Thanks for creating this issue

@zane-neo
Copy link
Contributor Author

zane-neo commented Aug 5, 2024

It's fine to not add a new listener since using new RetryableAction is also an option to achieve the target of this issue, e.g.

public class MyTransportAction {

  @override 
  public void execute(Request req, ActionListener listener) {
    //pre client call logic
    request = buildRequest(req);
    //clientCall(request, listener); //instead of using this direct client call, use retryableClientCall.
    retryableClientCall(request, listener);
  }
  
  private void clientCall(ClientRequest request, ActionListener listener) {
    client.xxx(request, listener);
  }

  private retryableClientCall(request, listener) {
    RetryableAction action = new RetryableAction(request, listener, backoffPolicy, executor, xxx) {
      // override necessary methods.
    }
    action.run();
  }
}

@zane-neo zane-neo closed this as completed Aug 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request Other
Projects
None yet
Development

No branches or pull requests

3 participants