Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] <Avoid sending requests to data nodes if they have already failed serving requests> #14757

Closed
viku-int opened this issue Jul 15, 2024 · 1 comment
Labels
enhancement Enhancement or improvement to existing feature or request Search:Resiliency

Comments

@viku-int
Copy link

Is your feature request related to a problem? Please describe

In the past twelve months, a series of incidents were caused by failures of data nodes. The issue pertains to the length of time required for load balancers to detect these failures, which clocks in at 150 seconds. As a standard configuration, the request timeout is set to anywhere between 5 seconds and to retry when the request timeouts or when we receive 5xx errors from OpenSearch, this presents a problem. Despite our implementation of multiple retries, certain requests continue to be directed towards nodes that have previously failed in service.

Describe the solution you'd like

Recommendation for OpenSearch service to have an additional HTTP header meant for IP addresses, such as "failed_datanode: 11.183.176.86:443". The ALB should then be programmed to bypass sending requests to datanodes marked as such whenever this header is present.

Related component

Search:Resiliency

Describe alternatives you've considered

It was requested if there was a way to configure / reduce the time taken by OS to detect failures. OS experts responded that 150 seconds is a standard setting and is recommended to change as sometimes the data nodes recover within this time period.

Additional context

I can collate all the AWS support cases that were opened related to this issue. Can send it directly to the engineer who is working on this feature.

@viku-int viku-int added enhancement Enhancement or improvement to existing feature or request untriaged labels Jul 15, 2024
@mch2
Copy link
Member

mch2 commented Jul 24, 2024

Thanks @viku-int, please reach out to AWS support regarding this issue. Closing this as it doesn't appear to be a generic issue with OS.

@mch2 mch2 closed this as completed Jul 24, 2024
@github-project-automation github-project-automation bot moved this from 🆕 New to ✅ Done in Search Project Board Jul 24, 2024
@mch2 mch2 removed the untriaged label Jul 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request Search:Resiliency
Projects
Archived in project
Development

No branches or pull requests

2 participants