[Feature Request] <Avoid sending requests to data nodes if they have already failed serving requests> #14757

viku-int · 2024-07-15T20:39:13Z

Is your feature request related to a problem? Please describe

In the past twelve months, a series of incidents were caused by failures of data nodes. The issue pertains to the length of time required for load balancers to detect these failures, which clocks in at 150 seconds. As a standard configuration, the request timeout is set to anywhere between 5 seconds and to retry when the request timeouts or when we receive 5xx errors from OpenSearch, this presents a problem. Despite our implementation of multiple retries, certain requests continue to be directed towards nodes that have previously failed in service.

Describe the solution you'd like

Recommendation for OpenSearch service to have an additional HTTP header meant for IP addresses, such as "failed_datanode: 11.183.176.86:443". The ALB should then be programmed to bypass sending requests to datanodes marked as such whenever this header is present.

Related component

Search:Resiliency

Describe alternatives you've considered

It was requested if there was a way to configure / reduce the time taken by OS to detect failures. OS experts responded that 150 seconds is a standard setting and is recommended to change as sometimes the data nodes recover within this time period.

Additional context

I can collate all the AWS support cases that were opened related to this issue. Can send it directly to the engineer who is working on this feature.

mch2 · 2024-07-24T16:15:36Z

Thanks @viku-int, please reach out to AWS support regarding this issue. Closing this as it doesn't appear to be a generic issue with OS.

viku-int added enhancement Enhancement or improvement to existing feature or request untriaged labels Jul 15, 2024

github-actions bot added the Search:Resiliency label Jul 15, 2024

github-project-automation bot added this to Search Project Board Jul 15, 2024

github-project-automation bot moved this to 🆕 New in Search Project Board Jul 15, 2024

mch2 closed this as completed Jul 24, 2024

github-project-automation bot moved this from 🆕 New to ✅ Done in Search Project Board Jul 24, 2024

mch2 removed the untriaged label Jul 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] <Avoid sending requests to data nodes if they have already failed serving requests> #14757

[Feature Request] <Avoid sending requests to data nodes if they have already failed serving requests> #14757

viku-int commented Jul 15, 2024

mch2 commented Jul 24, 2024

[Feature Request] <Avoid sending requests to data nodes if they have already failed serving requests> #14757

[Feature Request] <Avoid sending requests to data nodes if they have already failed serving requests> #14757

Comments

viku-int commented Jul 15, 2024

Is your feature request related to a problem? Please describe

Describe the solution you'd like

Related component

Describe alternatives you've considered

Additional context

mch2 commented Jul 24, 2024