[Feature Request] Add RetryableListener in opensearch core #13157

zane-neo · 2024-04-11T12:00:46Z

Is your feature request related to a problem? Please describe

Currently there're cases that using NodeClient to dispatch requests to other nodes to execute. But it's possible that the node been dispatched crash during processing the request, and user will get exception. In a large cluster, such random node failure is pretty normal and the exception can be a considerable inconvenient for user.
In opensearch core, we have RetryableAction and RetryableTransportClient but the Action is designed to an abstract class so many cases a transportAction can not extend it if it has to extends another class. The RetryListener seems to have a dedicated purpose and not generic.

Example
Model inference is a core functionality of ml-commons plugin, since it's resource consuming so not all nodes will be serving models, usually we use ml type nodes to run models, when a non-ml node received an inference request, it will dispatch the request to a ml node, but any node can crash at at time, so the coordinator node can get a NodeNotConnectedException or NodeDisconnectedException exception.
Without retry, the exception will be encapsulated and return to user(usually a 500 error), this can be improved easily by adding retry mechanism, the coordinator node can retry the request by sending it to another node thus user can get expected results.

Describe the solution you'd like

We can create a new RetryableListener in opensearch core, and when retry is necessary we can use this actionListener directly or override some methods in it, e.g. shouldRetry and retryFunction so the retry can be performed by leveraging these methods.

Related component

Other

Describe alternatives you've considered

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

peternied · 2024-04-11T19:19:55Z

In a large cluster, such random node failure is pretty normal and the exception can be a considerable inconvenient for user.

@zane-neo This statement gives me great concern, what do you mean node failure can you reference issues where these are occurring? I do believe that OpenSearch nodes should be robust and handle failures gracefully. Retry logic can be useful when failures are expected - such as TCP making IP communication robust [1]. However, nodes drop/crashes should be exceptional.

When a node is dropped or crashes the cluster is in a fragile state and retries can accelerate destabilization. Retries work when the resource being retried against is distributed so the request can be routed to a healthy alternative. There has been a much thought into how and when to implement retries I'd advocate reading this article and seeing how your scenario lines up [2].

Rabbit hole if you are interested in TCP/IP [1] https://stackoverflow.com/a/12956789/533057
https://codecurated.com/blog/designing-a-retry-mechanism-for-reliable-systems/

zane-neo · 2024-04-12T01:29:11Z

I do believe that OpenSearch nodes should be robust and handle failures gracefully.

@peternied Agree on that but what I meant is a high level statistics, we usually measure the service resilience by nine numbers: https://en.wikipedia.org/wiki/High_availability#Percentage_calculation. From the link let's say a service guarantee 4 nines, it still has 8.64 seconds downtime everyday in average, so retry can help on this case.

Retry logic can be useful when failures are expected - such as TCP making IP communication robust.

I think it's a different scenario, TCP assume the hardware failure is temporary and retry can work since there must be alternative routes to target. HTTP can not make this assumption since not all services are distributed so we don't have a default retry mechanism in HTTP, that's why we need to implement this manually. There're quite a lot retry framework out there and here are some of them: https://engineering.ripple.com/selecting-retry-frameworks-for-your-java-project/.

When a node is dropped or crashes the cluster is in a fragile state and retries can accelerate destabilization.

True, so retry should only work in some cases, e.g. NodeNotConnectedException and NodeDisconnectedException but not CircuitBreakerException or OutOfMemoryException. Mainly retry is composed in two part: retry condition and retry action, I'm suggesting we provide a template listener in OpenSearch core and the default retry condition only include those exceptions should be retried, and user can reuse this by overriding some method of this listener to satisfy their use cases.

shwetathareja · 2024-04-18T07:31:34Z

Thanks @zane-neo for creating the issue.

so the coordinator node can get a NodeNotConnectedException or NodeDisconnectedException exception.

In case the target node is throwing these exceptions, it might just be so that they are removed from cluster as well. So, in that case simply retrying may not even help.

It is still up to API implementation to handle specific transport errors and perform retries which is acceptable.

The RetryListener seems to have a dedicated purpose and not generic.

This is used specifically for reindex code path.

In opensearch core, we have RetryableAction

There is RetryingListener which is private to RetryableAction as the abstract class simplifies creating a retryable runnable action. Is that what you are looking for? Can you share the ml-commons API - transport API reference where you want to add this retry logic?

zane-neo · 2024-04-22T11:28:38Z

In case the target node is throwing these exceptions, it might just be so that they are removed from cluster as well. So, in that case simply retrying may not even help.

Make sense, retrying with sending request to a dropped node doesn't help, but in a different scenario client side might able to choose another node to retry the request.

Is that what you are looking for? Can you share the ml-commons API - transport API reference where you want to add this retry logic?

It's very similar to what I'm looking for but it's a private class. I have this commit in ml-commons API to retry the predict API: zane-neo/ml-commons@ad1f2ee

peternied · 2024-05-01T15:50:14Z

[Triage - attendees 1 2 3 4 5 6 7 8]
@zane-neo Thanks for creating this issue

zane-neo · 2024-08-05T07:51:45Z

It's fine to not add a new listener since using new RetryableAction is also an option to achieve the target of this issue, e.g.

public class MyTransportAction {

  @override 
  public void execute(Request req, ActionListener listener) {
    //pre client call logic
    request = buildRequest(req);
    //clientCall(request, listener); //instead of using this direct client call, use retryableClientCall.
    retryableClientCall(request, listener);
  }
  
  private void clientCall(ClientRequest request, ActionListener listener) {
    client.xxx(request, listener);
  }

  private retryableClientCall(request, listener) {
    RetryableAction action = new RetryableAction(request, listener, backoffPolicy, executor, xxx) {
      // override necessary methods.
    }
    action.run();
  }
}

zane-neo added enhancement Enhancement or improvement to existing feature or request untriaged labels Apr 11, 2024

github-actions bot added the Other label Apr 11, 2024

zane-neo mentioned this issue Apr 11, 2024

[Feature Request] Add RetryableListener in opensearch core #13132

Closed

zane-neo mentioned this issue Apr 16, 2024

[FEATURE] Add retry mechanism so predict API can success in node crash case opensearch-project/ml-commons#2327

Closed

peternied removed the untriaged label May 1, 2024

zane-neo mentioned this issue May 13, 2024

[RFC] Add retry with backoff for SageMaker throttling exception to mitigate the data lost problem opensearch-project/ml-commons#2438

Closed

zane-neo closed this as completed Aug 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] Add RetryableListener in opensearch core #13157

[Feature Request] Add RetryableListener in opensearch core #13157

zane-neo commented Apr 11, 2024

peternied commented Apr 11, 2024

zane-neo commented Apr 12, 2024 •

edited

Loading

shwetathareja commented Apr 18, 2024 •

edited

Loading

zane-neo commented Apr 22, 2024

peternied commented May 1, 2024

zane-neo commented Aug 5, 2024

[Feature Request] Add RetryableListener in opensearch core #13157

[Feature Request] Add RetryableListener in opensearch core #13157

Comments

zane-neo commented Apr 11, 2024

Is your feature request related to a problem? Please describe

Describe the solution you'd like

Related component

Describe alternatives you've considered

Additional context

peternied commented Apr 11, 2024

zane-neo commented Apr 12, 2024 • edited Loading

shwetathareja commented Apr 18, 2024 • edited Loading

zane-neo commented Apr 22, 2024

peternied commented May 1, 2024

zane-neo commented Aug 5, 2024

zane-neo commented Apr 12, 2024 •

edited

Loading

shwetathareja commented Apr 18, 2024 •

edited

Loading