The Kubernetes controller-runtime library provides a Prometheus metrics
endpoint by default. The Upjet based providers including the
upbound/provider-aws, upbound/provider-azure, upbound/provider-azuread and
upbound/provider-gcp expose
various metrics
from the controller-runtime to help monitor the health of the various runtime
components, such as the controller-runtime
client, the leader election
client, the controller workqueues, etc. In addition to these metrics, each
controller also
exposes
various metrics related to the reconciliation of the custom resources and active
reconciliation worker goroutines.
In addition to these metrics exposed by the controller-runtime
, the Upjet
based providers also expose metrics specific to the Upjet runtime. The Upjet
runtime registers some custom metrics using the
available extension mechanism,
and are available from the default /metrics
endpoint of the provider pod. Here
are these custom metrics exposed from the Upjet runtime:
upjet_terraform_cli_duration
: This is a histogram metric and reports statistics, in seconds, on how long it takes a Terraform CLI invocation to complete.upjet_terraform_active_cli_invocations
: This is a gauge metric and it's the number of active (running) Terraform CLI invocations.upjet_terraform_running_processes
: This is a gauge metric and it's the number of running Terraform CLI and Terraform provider processes.upjet_resource_ttr
: This is a histogram metric and it measures, in seconds, the time-to-readiness for managed resources.
Prometheus metrics can have labels associated with them to differentiate the characteristics of the measurements being made, such as differentiating between the CLI processes and the Terraform provider processes when counting the number of active Terraform processes running. Here is a list of labels associated with each of the above custom Upjet metrics:
- Labels associated with the
upjet_terraform_cli_duration
metric:subcommand
: Theterraform
subcommand that's run, e.g.,init
,apply
,plan
,destroy
, etc.mode
: The execution mode of the Terraform CLI, one ofsync
(so that the CLI was invoked synchronously as part of a reconcile loop),async
(so that the CLI was invoked asynchronously, the reconciler goroutine will poll and collect results in future).
- Labels associated with the
upjet_terraform_active_cli_invocations
metric:subcommand
: Theterraform
subcommand that's run, e.g.,init
,apply
,plan
,destroy
, etc.mode
: The execution mode of the Terraform CLI, one ofsync
(so that the CLI was invoked synchronously as part of a reconcile loop),async
(so that the CLI was invoked asynchronously, the reconciler goroutine will poll and collect results in future).
- Labels associated with the
upjet_terraform_running_processes
metric:type
: Eithercli
for Terraform CLI (theterraform
process) processes orprovider
for the Terraform provider processes. Please note that this is a best effort metric that may not be able to precisely catch & report all relevant processes. We may, in the future, improve this if needed by for example watching thefork
system calls. But currently, it may prove to be useful to watch rouge Terraform provider processes.
- Labels associated with the
upjet_resource_ttr
metric:group
,version
,kind
labels record the API group, version and kind for the managed resource, whose time-to-readiness measurement is captured.
You can export all these
custom metrics and the controller-runtime
metrics from the provider pod for
Prometheus. Here are some examples showing the custom metrics in action from the
Prometheus console:
-
upjet_terraform_active_cli_invocations
gauge metric showing the sync & asyncterraform init/apply/plan/destroy
invocations: -
upjet_terraform_running_processes
gauge metric showing bothcli
andprovider
labels: -
upjet_terraform_cli_duration
histogram metric, showing average Terraform CLI running times for the last 5m: -
The medians (0.5-quantiles) for these observations aggregated by the mode and Terraform subcommand being invoked:
-
upjet_resource_ttr
histogram metric, showing average resource TTR for the last 10m:
These samples have been collected by provisioning 10 upbound/provider-aws
cognitoidp.UserPool
resources by running the provider with a poll interval of
1m. In these examples, one can observe that the resources were polled
(reconciled) twice after they acquired the Ready=True
condition and after
that, they were destroyed.
You can find a full reference of the exposed metrics from the Upjet-based providers here.