Add prometheus wpa controller reconcile and wpa valid metrics #142
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Adds 2 metrics:
wpa_controller_reconcile_error
:1
with tagreason:<short_error_message>
if the last reconcile results in an error. If there is no error, no metric is reported.wpa_controller_reconcile_success
: Reports1
if the last reconcile is successful,0
if there was an error.Example:
Motivation
More ways to track WPA and WPA controller errors. There's a
controller_runtime_reconcile_total
metric, but that only has the labelscontroller
andresult
, which don't give much detail about the reconcile error.Describe your test plan
Set up a WPA(s) (example). The WPA should be valid, the target resource should be present, and the Datadog metric should be present and reporting consistently. The
wpa_controller_reconcile_success
metric should be present with value1
and the following labels:resource_kind
resource_name
resource_namespace
wpa_name
wpa_namespace
To visualize metrics, either collect them via the node agent with the
prometheus
oropenmetrics
check (example), or check the/metrics
endpoint:/metrics
endpoint:kubectl exec -it <wpa_controller> -- curl localhost:8383/metrics
Update the WPA (and/or target resource) to force an error (examples below) and ensure that the
wpa_controller_reconcile_success
metric reports0
and that thewpa_controller_reconcile_error
metric reports1
with the appropriatereason
tag. There shouldn't be any stale metrics; the metrics should update accordingly when going from an error to an ok state, error to another error state, and when the WPA is deleted.This isn't inclusive of all possible errors (and
reason
values), but here's a list of a few ways to force some errors:system.load.1.invalid
. This should give metrics withreason:failed_compute_replicas
andFailed to compute desired number of replicas based on listed metrics.
logs in the controller pod.spec.scaleTargetRef.apiVersion
to getreason:invalid_api_version
:spec.scaleTargetRef.apiVersion
. For example,extensions/v1beta1
for Deployments looks to have been deprecated in v1.16 so on a newer Kubernetes cluster with the target Deployment using theapps/v1
apiVersion, using the old apiVersion results inreason:unknown_resource
and the log lineunable to determine resource for scale target reference
:spec.minReplicas
larger than thespec.maxReplicas
to show the log messageInvalid WPA specification: watermark pod autoscaler requires the minimum number of replicas to be configured and inferior to the maximum
and the tagreason:invalid_wpa_spec
: