Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

many "Client.Timeout Exceeded while waiting header" on keda operator log #3610

Closed
tshaiman opened this issue Aug 29, 2022 · 10 comments · Fixed by #3640
Closed

many "Client.Timeout Exceeded while waiting header" on keda operator log #3610

tshaiman opened this issue Aug 29, 2022 · 10 comments · Fixed by #3640
Assignees

Comments

@tshaiman
Copy link

tshaiman commented Aug 29, 2022

Report

when Keda pod launches it starts normally and manage to reconcile all queues and all definition correctly.
there are lot of "healthy" info report .

then after few mintues the log is floaded with "Client.Timeout exceeeded while waiting header".

Expected Behavior

  • we want to understand what this error represent
  • we want to see how and where those timeout can be configured
  • we want to understand why a "healthy" reconcile object turned into timeout error

Actual Behavior

  • those errors do not seem to effect the scale up mechanism , as it seems to continue working
  • we dont want to see so many error logs without the ability to understand what is the issue and how to fine-tune the timeouts

Steps to Reproduce the Problem

  1. nothing special we just deployed keda on a relative large cluster with many namespaces ,and each namespace has 10-13 queues
  2. it could be the developer has created TriggerAutheticationObject Per Queue instead per namespace which caused around 130 TriggerAuthentication object to create -> could that be the problem here ?
  3. Cluster is configured with Managed Identity Support (aad-pod-identity) not with Workload Identity
  4. issue happens both on keda 2.7.1 and 2.8.0

Logs from KEDA operator

2022-08-29T19:28:24Z	INFO	Reconciling ScaledObject	{"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject", "scaledObject": {"name":"fc","namespace":"vi-be-map-dev6"}, "namespace": "vi-be-map-dev6", "name": "fc", "reconcileID": "e1df9a0e-a6bd-4558-bc4e-c07c94aed5d0"}
2022-08-29T19:28:24Z	INFO	Reconciling ScaledObject	{"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject", "scaledObject": {"name":"tm","namespace":"vi-be-map-dev12"}, "namespace": "vi-be-map-dev12", "name": "tm", "reconcileID": "184cd058-cee7-4975-840b-3f58b8903305"}
2022-08-29T19:28:24Z	INFO	Reconciling ScaledObject	{"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject", "scaledObject": {"name":"cb","namespace":"vi-be-map-dev6"}, "namespace": "vi-be-map-dev6", "name": "cb", "reconcileID": "ce4c60c5-2772-4be6-a902-785b50af41fa"}
2022-08-29T19:28:25Z	INFO	Reconciling ScaledObject	{"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject", "scaledObject": {"name":"dp","namespace":"vi-be-map-dev8"}, "namespace": "vi-be-map-dev8", "name": "dp", "reconcileID": "8f3d16c4-6589-44f0-90ee-881eba5aa0d1"}
2022-08-29T19:32:09Z	INFO	Reconciling ScaledObject	{"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject", "scaledObject": {"name":"aed","namespace":"vi-be-map-dev8"}, "namespace": "vi-be-map-dev8", "name": "aed", "reconcileID": "24b8af3c-c2c4-49c7-ac8b-77fe6123b73c"}
2022-08-29T19:32:09Z	INFO	Initializing Scaling logic according to ScaledObject Specification	{"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject", "scaledObject": {"name":"aed","namespace":"vi-be-map-dev8"}, "namespace": "vi-be-map-dev8", "name": "aed", "reconcileID": "24b8af3c-c2c4-49c7-ac8b-77fe6123b73c"}
2022-08-29T19:32:10Z	INFO	scaleexecutor	Successfully set ScaleTarget replicas count to ScaledObject minReplicaCount	{"scaledobject.Name": "aed", "scaledObject.Namespace": "vi-be-map-dev8", "scaleTarget.Name": "vi-aed", "Original Replicas Count": 1, "New Replicas Count": 0}
2022-08-29T19:32:21Z	ERROR	azure_servicebus_scaler	error	{"type": "ScaledObject", "namespace": "vi-be-map-dev6", "name": "rc", "error": "Get \"http://169.254.169.254/metadata/identity/oauth2/token?api-version=2018-02-01&resource=https%3A%2F%2Fservicebus.azure.net%2F\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"}
github.com/kedacore/keda/v2/pkg/scaling/cache.(*ScalersCache).IsScaledObjectActive
	/workspace/pkg/scaling/cache/scalers_cache.go:89
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).checkScalers
	/workspace/pkg/scaling/scale_handler.go:278
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).startScaleLoop
	/workspace/pkg/scaling/scale_handler.go:149
2022-08-29T19:32:23Z	INFO	Reconciling ScaledObject	{"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject", "scaledObject": {"name":"aed","namespace":"vi-be-map-dev8"}, "namespace": "vi-be-map-dev8", "name": "aed", "reconcileID": "25e7b3df-cf95-42ef-b170-05fe46187017"}
2022-08-29T19:32:28Z	ERROR	azure_servicebus_scaler	error	{"type": "ScaledObject", "namespace": "vi-be-map-dev12", "name": "frameextraction", "error": "Get \"http://169.254.169.254/metadata/identity/oauth2/token?api-version=2018-02-01&resource=https%3A%2F%2Fservicebus.azure.net%2F\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"}
github.com/kedacore/keda/v2/pkg/scaling/cache.(*ScalersCache).IsScaledObjectActive
	/workspace/pkg/scaling/cache/scalers_cache.go:89
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).checkScalers
	/workspace/pkg/scaling/scale_handler.go:278
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).startScaleLoop
	/workspace/pkg/scaling/scale_handler.go:149
2022-08-29T19:32:29Z	ERROR	azure_servicebus_scaler	error	{"type": "ScaledObject", "namespace": "vi-be-map-dev9", "name": "frameextraction", "error": "Get \"http://169.254.169.254/metadata/identity/oauth2/token?api-version=2018-02-01&resource=https%3A%2F%2Fservicebus.azure.net%2F\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"}
github.com/kedacore/keda/v2/pkg/scaling/cache.(*ScalersCache).IsScaledObjectActive
	/workspace/pkg/scaling/cache/scalers_cache.go:89
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).checkScalers
	/workspace/pkg/scaling/scale_handler.go:278
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).startScaleLoop
	/workspace/pkg/scaling/scale_handler.go:149

KEDA Version

2.7.1

Kubernetes Version

1.23

Platform

Microsoft Azure

Scaler Details

Azure Service Bus

Anything else?

  • happens both on 2.7.1 and 2.8.0 cluster
  • happens only on relative big clusters
    No response
@tshaiman tshaiman added the bug Something isn't working label Aug 29, 2022
@JorTurFer
Copy link
Member

JorTurFer commented Aug 30, 2022

The url seems to be the aad-pod-identity instance. Could you check the logs you have there?
Did you have the same issue with previous versions?

@tshaiman
Copy link
Author

@JorTurFer : that make sense as there is a limitation of max 20 calls concurrently to the IMDS , and we are doing much more.
will move to workload identity soon ;-)

@JorTurFer
Copy link
Member

so, could this issue be more related with that than KEDA itself? I mean, do you think is KEDA related? We can keep this open till you move from pod identity to workload identity, but just to know if we need to go deeper or not

@tshaiman
Copy link
Author

I think the ask from the Keda team is maybe to catch this exception and either ignore it if the 2nd / 3rd attempt succeeded
(in other words : don't throw if the retry mechanism is working )
and also try to be more verbose : what is the timeout , on which component/scaler , how can we configure longer timeouts etc.

@tomkerkhove
Copy link
Member

We should also check if there is a way to optimize the integration to reduce amount of calls, what do you think?

@JorTurFer
Copy link
Member

JorTurFer commented Aug 30, 2022

I don't think that we should ignore the error because we don't retry it in the same cycle. The reconciliation loop fails and it will be retried, but not only the request, the whole reconciliation loop. I mean, we don't know inside the cycle if another cycle will be executed, we could assume it, but we are not 100%, so we cannot trust on future executions.

WRT timeouts, this uses the default timeout for every HTTP request inside KEDA (3 seconds). This value (and how to change it) is reflected in the docs , you can modify it just setting the environment variable KEDA_HTTP_DEFAULT_TIMEOUT or modifying the helm value if you use helm

Finally, related with the verbosity, the log already prints the scaler type (azure_servicebus_scaler), the SO (frameextraction) where it has happened and the namespace (vi-be-map-dev9).

ERROR	azure_servicebus_scale error	{"type": "ScaledObject", "namespace": "vi-be-map-dev9", "name": "frameextraction", "error": "Get \"http://169.254.169.254/metadata/identity/oauth2/token?api-version=2018-02-01&resource=https%3A%2F%2Fservicebus.azure.net%2F\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"}

What do you think we could append to be clearer? I mean, the timeout is already known, but maybe something like error getting aad token: CURRENT MESSAGE could point better to the current issue? 🤔

@tshaiman
Copy link
Author

yes that was my thoughts as the go stack is almost meaningless.
thanks for pointing out the ENV Variable to increase timeout ,I will Definity try it .
and yes, I do see the log is quite verbose , thank you

@JorTurFer
Copy link
Member

We should also check if there is a way to optimize the integration to reduce amount of calls, what do you think?

@zroubalik and I have been talking recently about how to apply this. It'll improve amount of calls for sure, but maybe we can try to improve it more, IDK

@JorTurFer
Copy link
Member

do you think that adding something like error getting aad-pod-identity token: in the error message would help to detect this in the future? It's a simple thing to do if it helps

@tomkerkhove
Copy link
Member

I think that would be helpful, yes

@tomkerkhove tomkerkhove removed the bug Something isn't working label Aug 30, 2022
@JorTurFer JorTurFer self-assigned this Sep 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants