diff --git a/docs/monitoring.md b/docs/monitoring.md new file mode 100644 index 00000000..991a4df5 --- /dev/null +++ b/docs/monitoring.md @@ -0,0 +1,69 @@ +## Monitoring the Upjet Runtime +The [Kubernetes controller-runtime] library provides a Prometheus metrics endpoint by default. The Upjet based providers +including the [upbound/provider-aws], [upbound/provider-azure], [upbound/provider-azuread] and [upbound/provider-gcp] +expose [various metrics](https://book-v1.book.kubebuilder.io/beyond_basics/controller_metrics.html) from the controller-runtime +to help monitor the health of the various runtime components, such as the [`controller-runtime` client], +the [leader election client], the [controller workqueues], etc. In addition to these metrics, each controller +also [exposes](https://github.com/kubernetes-sigs/controller-runtime/blob/60af59f5b22335516850ca11c974c8f614d5d073/pkg/internal/controller/metrics/metrics.go#L25) +various metrics related to the reconciliation of the custom resources and active reconciliation worker goroutines. + +In addition to these metrics exposed by the `controller-runtime`, the Upjet based providers also expose metrics specific to +the Upjet runtime. The Upjet runtime registers some custom metrics using the [available extension mechanism](https://book.kubebuilder.io/reference/metrics.html#publishing-additional-metrics), +and are available from the default `/metrics` endpoint of the provider pod. Here are these custom metrics exposed from the +Upjet runtime: +- `upjet_terraform_cli_duration`: This is a histogram metric and reports statistics, in seconds, on how long it takes a Terraform CLI invocation to complete. +- `upjet_terraform_active_cli_invocations`: This is a gauge metric and it's the number of active (running) Terraform CLI invocations. +- `upjet_terraform_running_processes`: This is a gauge metric and it's the number of running Terraform CLI and Terraform provider processes. +- `upjet_resource_ttr`: This is a histogram metric and it measures, in seconds, the time-to-readiness for managed resources. + +Prometheus metrics can have [labels] associated with them to differentiate the characteristics of the measurements being +made, such as differentiating between the CLI processes and the Terraform provider processes when counting the number of +active Terraform processes running. Here is a list of labels associated with each of the above custom Upjet metrics: +- Labels associated with the `upjet_terraform_cli_duration` metric: + - `subcommand`: The `terraform` subcommand that's run, e.g., `init`, `apply`, `plan`, `destroy`, etc. + - `mode`: The execution mode of the Terraform CLI, one of `sync` (so that the CLI was invoked synchronously as part of a reconcile loop), `async` (so that the CLI was invoked asynchronously, the reconciler goroutine will poll and collect results in future). +- Labels associated with the `upjet_terraform_active_cli_invocations` metric: + - `subcommand`: The `terraform` subcommand that's run, e.g., `init`, `apply`, `plan`, `destroy`, etc. + - `mode`: The execution mode of the Terraform CLI, one of `sync` (so that the CLI was invoked synchronously as part of a reconcile loop), `async` (so that the CLI was invoked asynchronously, the reconciler goroutine will poll and collect results in future). +- Labels associated with the `upjet_terraform_running_processes` metric: + - `type`: Either `cli` for Terraform CLI (the `terraform` process) processes or `provider` for the Terraform provider processes. Please note that this is a best effort metric that may not be able to precisely catch & report all relevant processes. We may, in the future, improve this if needed by for example watching the `fork` system calls. But currently, it may prove to be useful to watch rouge Terraform provider processes. +- Labels associated with the `upjet_resource_ttr` metric: + - `group`, `version`, `kind` labels record the [API group, version and kind](https://kubernetes.io/docs/reference/using-api/api-concepts/) for the managed resource, whose [time-to-readiness](https://github.com/crossplane/terrajet/issues/55#issuecomment-929494212) measurement is captured. + +You can [export](https://book.kubebuilder.io/reference/metrics.html) all these custom metrics and +the `controller-runtime` metrics from the provider pod for Prometheus. Here are some examples showing the custom metrics +in action from the Prometheus console: + +- `upjet_terraform_active_cli_invocations` gauge metric showing the sync & async `terraform init/apply/plan/destroy` invocations: + image + +- `upjet_terraform_running_processes` gauge metric showing both `cli` and `provider` labels: + image + +- `upjet_terraform_cli_duration` histogram metric, showing average Terraform CLI running times for the last 5m: + image + +- The medians (0.5-quantiles) for these observations aggregated by the mode and Terraform subcommand being invoked: +image + +- `upjet_resource_ttr` histogram metric, showing average resource TTR for the last 10m: + image + +- The median (0.5-quantile) for these TTR observations: +image + +These samples have been collected by provisioning 10 [upbound/provider-aws] `cognitoidp.UserPool` resources by running the +provider with a poll interval of 1m. In these examples, one can observe that the resources were polled (reconciled) twice +after they acquired the `Ready=True` condition and after that, they were destroyed. + +You can find a full reference of the exposed metrics from the Upjet-based providers [here](provider_metrics_help.txt). + +[Kubernetes controller-runtime]: https://github.com/kubernetes-sigs/controller-runtime +[upbound/provider-aws]: https://github.com/upbound/provider-aws +[upbound/provider-azure]: https://github.com/upbound/provider-azure +[upbound/provider-azuread]: https://github.com/upbound/provider-azuread +[upbound/provider-gcp]: https://github.com/upbound/provider-gcp +[`controller-runtime` client]: https://github.com/kubernetes-sigs/controller-runtime/blob/60af59f5b22335516850ca11c974c8f614d5d073/pkg/metrics/client_go_adapter.go#L40 +[leader election client]: https://github.com/kubernetes-sigs/controller-runtime/blob/60af59f5b22335516850ca11c974c8f614d5d073/pkg/metrics/leaderelection.go#L12 +[controller workqueues]: https://github.com/kubernetes-sigs/controller-runtime/blob/60af59f5b22335516850ca11c974c8f614d5d073/pkg/metrics/workqueue.go#L40 +[labels]: https://prometheus.io/docs/practices/naming/#labels diff --git a/docs/provider_metrics_help.txt b/docs/provider_metrics_help.txt new file mode 100644 index 00000000..638a829c --- /dev/null +++ b/docs/provider_metrics_help.txt @@ -0,0 +1,147 @@ +# HELP upjet_terraform_cli_duration Measures in seconds how long it takes a Terraform CLI invocation to complete +# TYPE upjet_terraform_cli_duration histogram + +# HELP upjet_terraform_running_processes The number of running Terraform CLI and Terraform provider processes +# TYPE upjet_terraform_running_processes gauge + +# HELP upjet_resource_ttr Measures in seconds the time-to-readiness (TTR) for managed resources +# TYPE upjet_resource_ttr histogram + +# HELP upjet_terraform_active_cli_invocations The number of active (running) Terraform CLI invocations +# TYPE upjet_terraform_active_cli_invocations gauge + +# HELP certwatcher_read_certificate_errors_total Total number of certificate read errors +# TYPE certwatcher_read_certificate_errors_total counter + +# HELP certwatcher_read_certificate_total Total number of certificate reads +# TYPE certwatcher_read_certificate_total counter + +# HELP controller_runtime_active_workers Number of currently used workers per controller +# TYPE controller_runtime_active_workers gauge + +# HELP controller_runtime_max_concurrent_reconciles Maximum number of concurrent reconciles per controller +# TYPE controller_runtime_max_concurrent_reconciles gauge + +# HELP controller_runtime_reconcile_errors_total Total number of reconciliation errors per controller +# TYPE controller_runtime_reconcile_errors_total counter + +# HELP controller_runtime_reconcile_time_seconds Length of time per reconciliation per controller +# TYPE controller_runtime_reconcile_time_seconds histogram + +# HELP controller_runtime_reconcile_total Total number of reconciliations per controller +# TYPE controller_runtime_reconcile_total counter + +# HELP go_gc_duration_seconds A summary of the pause duration of garbage collection cycles. +# TYPE go_gc_duration_seconds summary + +# HELP go_goroutines Number of goroutines that currently exist. +# TYPE go_goroutines gauge + +# HELP go_info Information about the Go environment. +# TYPE go_info gauge + +# HELP go_memstats_alloc_bytes Number of bytes allocated and still in use. +# TYPE go_memstats_alloc_bytes gauge + +# HELP go_memstats_alloc_bytes_total Total number of bytes allocated, even if freed. +# TYPE go_memstats_alloc_bytes_total counter + +# HELP go_memstats_buck_hash_sys_bytes Number of bytes used by the profiling bucket hash table. +# TYPE go_memstats_buck_hash_sys_bytes gauge + +# HELP go_memstats_frees_total Total number of frees. +# TYPE go_memstats_frees_total counter + +# HELP go_memstats_gc_sys_bytes Number of bytes used for garbage collection system metadata. +# TYPE go_memstats_gc_sys_bytes gauge + +# HELP go_memstats_heap_alloc_bytes Number of heap bytes allocated and still in use. +# TYPE go_memstats_heap_alloc_bytes gauge + +# HELP go_memstats_heap_idle_bytes Number of heap bytes waiting to be used. +# TYPE go_memstats_heap_idle_bytes gauge + +# HELP go_memstats_heap_inuse_bytes Number of heap bytes that are in use. +# TYPE go_memstats_heap_inuse_bytes gauge + +# HELP go_memstats_heap_objects Number of allocated objects. +# TYPE go_memstats_heap_objects gauge + +# HELP go_memstats_heap_released_bytes Number of heap bytes released to OS. +# TYPE go_memstats_heap_released_bytes gauge + +# HELP go_memstats_heap_sys_bytes Number of heap bytes obtained from system. +# TYPE go_memstats_heap_sys_bytes gauge + +# HELP go_memstats_last_gc_time_seconds Number of seconds since 1970 of last garbage collection. +# TYPE go_memstats_last_gc_time_seconds gauge + +# HELP go_memstats_lookups_total Total number of pointer lookups. +# TYPE go_memstats_lookups_total counter + +# HELP go_memstats_mallocs_total Total number of mallocs. +# TYPE go_memstats_mallocs_total counter + +# HELP go_memstats_mcache_inuse_bytes Number of bytes in use by mcache structures. +# TYPE go_memstats_mcache_inuse_bytes gauge + +# HELP go_memstats_mcache_sys_bytes Number of bytes used for mcache structures obtained from system. +# TYPE go_memstats_mcache_sys_bytes gauge + +# HELP go_memstats_mspan_inuse_bytes Number of bytes in use by mspan structures. +# TYPE go_memstats_mspan_inuse_bytes gauge + +# HELP go_memstats_mspan_sys_bytes Number of bytes used for mspan structures obtained from system. +# TYPE go_memstats_mspan_sys_bytes gauge + +# HELP go_memstats_next_gc_bytes Number of heap bytes when next garbage collection will take place. +# TYPE go_memstats_next_gc_bytes gauge + +# HELP go_memstats_other_sys_bytes Number of bytes used for other system allocations. +# TYPE go_memstats_other_sys_bytes gauge + +# HELP go_memstats_stack_inuse_bytes Number of bytes in use by the stack allocator. +# TYPE go_memstats_stack_inuse_bytes gauge + +# HELP go_memstats_stack_sys_bytes Number of bytes obtained from system for stack allocator. +# TYPE go_memstats_stack_sys_bytes gauge + +# HELP go_memstats_sys_bytes Number of bytes obtained from system. +# TYPE go_memstats_sys_bytes gauge + +# HELP go_threads Number of OS threads created. +# TYPE go_threads gauge + +# HELP rest_client_request_duration_seconds Request latency in seconds. Broken down by verb, and host. +# TYPE rest_client_request_duration_seconds histogram + +# HELP rest_client_request_size_bytes Request size in bytes. Broken down by verb and host. +# TYPE rest_client_request_size_bytes histogram + +# HELP rest_client_requests_total Number of HTTP requests, partitioned by status code, method, and host. +# TYPE rest_client_requests_total counter + +# HELP rest_client_response_size_bytes Response size in bytes. Broken down by verb and host. +# TYPE rest_client_response_size_bytes histogram + +# HELP workqueue_adds_total Total number of adds handled by workqueue +# TYPE workqueue_adds_total counter + +# HELP workqueue_depth Current depth of workqueue +# TYPE workqueue_depth gauge + +# HELP workqueue_longest_running_processor_seconds How many seconds has the longest running processor for workqueue been running. +# TYPE workqueue_longest_running_processor_seconds gauge + +# HELP workqueue_queue_duration_seconds How long in seconds an item stays in workqueue before being requested +# TYPE workqueue_queue_duration_seconds histogram + +# HELP workqueue_retries_total Total number of retries handled by workqueue +# TYPE workqueue_retries_total counter + +# HELP workqueue_unfinished_work_seconds How many seconds of work has been done that is in progress and hasn't been observed by work_duration. Large values indicate stuck threads. One can deduce the number of stuck threads by observing the rate at which this increases. +# TYPE workqueue_unfinished_work_seconds gauge + +# HELP workqueue_work_duration_seconds How long in seconds processing an item from workqueue takes. +# TYPE workqueue_work_duration_seconds histogram +