You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe
In the past twelve months, a series of incidents were caused by failures of data nodes. The issue pertains to the length of time required for load balancers to detect these failures, which clocks in at 150 seconds. As a standard configuration, the request timeout is set to anywhere between 5 seconds and to retry when the request timeouts or when we receive 5xx errors from OpenSearch, this presents a problem. Despite our implementation of multiple retries, certain requests continue to be directed towards nodes that have previously failed in service.
Describe the solution you'd like
Recommendation for OpenSearch service to have an additional HTTP header meant for IP addresses, such as "failed_datanode: 11.183.176.86:443". The ALB should then be programmed to bypass sending requests to datanodes marked as such whenever this header is present.
Related component
Search:Resiliency
Describe alternatives you've considered
It was requested if there was a way to configure / reduce the time taken by OS to detect failures. OS experts responded that 150 seconds is a standard setting and is recommended to change as sometimes the data nodes recover within this time period.
Additional context
I can collate all the AWS support cases that were opened related to this issue. Can send it directly to the engineer who is working on this feature.
The text was updated successfully, but these errors were encountered:
Is your feature request related to a problem? Please describe
In the past twelve months, a series of incidents were caused by failures of data nodes. The issue pertains to the length of time required for load balancers to detect these failures, which clocks in at 150 seconds. As a standard configuration, the request timeout is set to anywhere between 5 seconds and to retry when the request timeouts or when we receive 5xx errors from OpenSearch, this presents a problem. Despite our implementation of multiple retries, certain requests continue to be directed towards nodes that have previously failed in service.
Describe the solution you'd like
Recommendation for OpenSearch service to have an additional HTTP header meant for IP addresses, such as "failed_datanode: 11.183.176.86:443". The ALB should then be programmed to bypass sending requests to datanodes marked as such whenever this header is present.
Related component
Search:Resiliency
Describe alternatives you've considered
It was requested if there was a way to configure / reduce the time taken by OS to detect failures. OS experts responded that 150 seconds is a standard setting and is recommended to change as sometimes the data nodes recover within this time period.
Additional context
I can collate all the AWS support cases that were opened related to this issue. Can send it directly to the engineer who is working on this feature.
The text was updated successfully, but these errors were encountered: