Scaler fails only when failing to get counts from all the interceptor endpoints #903

Mizhentaotuo · 2024-01-25T18:19:10Z

Provide a description of what has been changed
We observe behavior that the scaler fails and exit the loop when failing to get counts from any of the interceptor replica.
Not sure this is the intended behavior but sometimes one interceptor replica is down only because it is on a spot node. When the node is down and the endpoints of the interceptor service is not updated yet, the scaler still try to get from and endpoint which does not exist. And most of the time the killed interceptor pod will heal itself.

Checklist

Commits are signed with Developer Certificate of Origin (DCO)
Changelog has been updated and is aligned with our changelog requirements
Any necessary documentation is added, such as:

Fixes #
Change so that the scaler fails only when fetching all the counts failed.

Comment:
I am new to this, not sure the existing version is the intended behavior. Please let me know if there is a better way or it can be handled by any config value that I am not aware of. Appreciated.

Signed-off-by: mingzhe <[email protected]>

Signed-off-by: Mizhentaotuo Signed-off-by: mingzhe <[email protected]>

JorTurFer · 2024-01-25T19:41:24Z

Hey!
Thanks for the PR, but I don't get the point (but probably I'm missing something), the interceptor cache is updated every second, so I'd expect that the Scaler status isn't really affected because when you drain a node, the endpoints should get removed quite fast from the list of endpoints. Am I missing something?
We decided to restart the scaler if the metrics aren't available as getting all the endpoints it's the only way to ensure that the traffic is measured well, otherwise the scaler could respond with a wrong value which triggers the scaling in, impacting the users.

JorTurFer · 2024-01-25T19:42:52Z

Said that, I guess that we could try to figure out a better way to hit the interceptors, something like getting the ready pods and going through calculating the endpoints in scaler side instead of using k8s endpoints.

Mizhentaotuo · 2024-01-26T08:48:21Z

Hey! Thanks for the PR, but I don't get the point (but probably I'm missing something), the interceptor cache is updated every second, so I'd expect that the Scaler status isn't really affected because when you drain a node, the endpoints should get removed quite fast from the list of endpoints. Am I missing something? We decided to restart the scaler if the metrics aren't available as getting all the endpoints it's the only way to ensure that the traffic is measured well, otherwise the scaler could respond with a wrong value which triggers the scaling in, impacting the users.

Hey! Thanks a lot for the quick reply.

so I'd expect that the Scaler status isn't really affected because when you drain a node, the endpoints should get removed quite fast from the list of endpoints. Am I missing something?

No, no I agree that is the expected behavior. But the behavior on our cluster is that the scaler failed because one node is removed by gcp, and it spin up another node which takes some time (could be 1 min). could be that the endpoints list is updated, but just because the node is not ready yet, so the pod is not ready either? This part I do not know much about.
in any case, I think your idea of getting the ready pod first is nice, I will try to update the PR.
Thanks again!

Mizhentaotuo requested a review from a team as a code owner January 25, 2024 18:19

Mizhentaotuo added 3 commits January 25, 2024 19:20

start

b92f46a

Signed-off-by: mingzhe <[email protected]>

add comment

987be34

Signed-off-by: Mizhentaotuo Signed-off-by: mingzhe <[email protected]>

add CHANGELOG

2f08ab4

Signed-off-by: Mizhentaotuo Signed-off-by: mingzhe <[email protected]>

Mizhentaotuo force-pushed the main branch from 068bddc to 2f08ab4 Compare January 25, 2024 18:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scaler fails only when failing to get counts from all the interceptor endpoints #903

Scaler fails only when failing to get counts from all the interceptor endpoints #903

Mizhentaotuo commented Jan 25, 2024 •

edited

Loading

JorTurFer commented Jan 25, 2024

JorTurFer commented Jan 25, 2024

Mizhentaotuo commented Jan 26, 2024 •

edited

Loading

Scaler fails only when failing to get counts from all the interceptor endpoints #903

Are you sure you want to change the base?

Scaler fails only when failing to get counts from all the interceptor endpoints #903

Conversation

Mizhentaotuo commented Jan 25, 2024 • edited Loading

Checklist

JorTurFer commented Jan 25, 2024

JorTurFer commented Jan 25, 2024

Mizhentaotuo commented Jan 26, 2024 • edited Loading

Mizhentaotuo commented Jan 25, 2024 •

edited

Loading

Mizhentaotuo commented Jan 26, 2024 •

edited

Loading