From d017bbe074a4801ca01086f9b8ad0337fffdb1e8 Mon Sep 17 00:00:00 2001
From: Alper Rifat Ulucinar <ulucinar@users.noreply.github.com>
Date: Fri, 31 Mar 2023 17:46:43 +0300
Subject: [PATCH] Add an introductory Upjet-based provider monitoring guide

Signed-off-by: Alper Rifat Ulucinar <ulucinar@users.noreply.github.com>
---
 docs/monitoring.md             | 116 ++++++++++++++++++++++++++
 docs/provider_metrics_help.txt | 147 +++++++++++++++++++++++++++++++++
 2 files changed, 263 insertions(+)
 create mode 100644 docs/monitoring.md
 create mode 100644 docs/provider_metrics_help.txt

diff --git a/docs/monitoring.md b/docs/monitoring.md
new file mode 100644
index 00000000..314d915f
--- /dev/null
+++ b/docs/monitoring.md
@@ -0,0 +1,116 @@
+## Monitoring the Upjet Runtime
+The [Kubernetes controller-runtime] library provides a Prometheus metrics
+endpoint by default. The Upjet based providers including the
+[upbound/provider-aws], [upbound/provider-azure], [upbound/provider-azuread] and
+[upbound/provider-gcp] expose [various
+metrics](https://book.kubebuilder.io/reference/metrics-reference.html)
+from the controller-runtime to help monitor the health of the various runtime
+components, such as the [`controller-runtime` client], the [leader election
+client], the [controller workqueues], etc. In addition to these metrics, each
+controller also
+[exposes](https://github.com/kubernetes-sigs/controller-runtime/blob/60af59f5b22335516850ca11c974c8f614d5d073/pkg/internal/controller/metrics/metrics.go#L25)
+various metrics related to the reconciliation of the custom resources and active
+reconciliation worker goroutines.
+
+In addition to these metrics exposed by the `controller-runtime`, the Upjet
+based providers also expose metrics specific to the Upjet runtime. The Upjet
+runtime registers some custom metrics using the [available extension
+mechanism](https://book.kubebuilder.io/reference/metrics.html#publishing-additional-metrics),
+and are available from the default `/metrics` endpoint of the provider pod. Here
+are these custom metrics exposed from the Upjet runtime:
+- `upjet_terraform_cli_duration`: This is a histogram metric and reports
+  statistics, in seconds, on how long it takes a Terraform CLI invocation to
+  complete.
+- `upjet_terraform_active_cli_invocations`: This is a gauge metric and it's the
+  number of active (running) Terraform CLI invocations.
+- `upjet_terraform_running_processes`: This is a gauge metric and it's the
+  number of running Terraform CLI and Terraform provider processes.
+- `upjet_resource_ttr`: This is a histogram metric and it measures, in seconds,
+  the time-to-readiness for managed resources.
+
+Prometheus metrics can have [labels] associated with them to differentiate the
+characteristics of the measurements being made, such as differentiating between
+the CLI processes and the Terraform provider processes when counting the number
+of active Terraform processes running. Here is a list of labels associated with
+each of the above custom Upjet metrics:
+- Labels associated with the `upjet_terraform_cli_duration` metric:
+    - `subcommand`: The `terraform` subcommand that's run, e.g., `init`,
+      `apply`, `plan`, `destroy`, etc.
+    - `mode`: The execution mode of the Terraform CLI, one of `sync` (so that
+      the CLI was invoked synchronously as part of a reconcile loop), `async`
+      (so that the CLI was invoked asynchronously, the reconciler goroutine will
+      poll and collect results in future).
+- Labels associated with the `upjet_terraform_active_cli_invocations` metric:
+    - `subcommand`: The `terraform` subcommand that's run, e.g., `init`,
+      `apply`, `plan`, `destroy`, etc.
+    - `mode`: The execution mode of the Terraform CLI, one of `sync` (so that
+      the CLI was invoked synchronously as part of a reconcile loop), `async`
+      (so that the CLI was invoked asynchronously, the reconciler goroutine will
+      poll and collect results in future).
+- Labels associated with the `upjet_terraform_running_processes` metric:
+    - `type`: Either `cli` for Terraform CLI (the `terraform` process) processes
+      or `provider` for the Terraform provider processes. Please note that this
+      is a best effort metric that may not be able to precisely catch & report
+      all relevant processes. We may, in the future, improve this if needed by
+      for example watching the `fork` system calls. But currently, it may prove
+      to be useful to watch rouge Terraform provider processes.
+- Labels associated with the `upjet_resource_ttr` metric:
+    - `group`, `version`, `kind` labels record the [API group, version and
+      kind](https://kubernetes.io/docs/reference/using-api/api-concepts/) for
+      the managed resource, whose
+      [time-to-readiness](https://github.com/crossplane/terrajet/issues/55#issuecomment-929494212)
+      measurement is captured.
+
+## Examples
+You can [export](https://book.kubebuilder.io/reference/metrics.html) all these
+custom metrics and the `controller-runtime` metrics from the provider pod for
+Prometheus. Here are some examples showing the custom metrics in action from the
+Prometheus console:
+
+- `upjet_terraform_active_cli_invocations` gauge metric showing the sync & async
+  `terraform init/apply/plan/destroy` invocations: <img width="3000" alt="image"
+  src="https://user-images.githubusercontent.com/9376684/223296539-94e7d634-58b0-4d3f-942e-8b857eb92ef7.png">
+
+- `upjet_terraform_running_processes` gauge metric showing both `cli` and
+  `provider` labels: <img width="2999" alt="image"
+  src="https://user-images.githubusercontent.com/9376684/223297575-18c2232e-b5af-4cc1-916a-d61fe5dfb527.png">
+
+- `upjet_terraform_cli_duration` histogram metric, showing average Terraform CLI
+  running times for the last 5m: <img width="2993" alt="image"
+  src="https://user-images.githubusercontent.com/9376684/223299401-8f128b74-8d9c-4c82-86c5-26870385bee7.png">
+
+- The medians (0.5-quantiles) for these observations aggregated by the mode and
+Terraform subcommand being invoked: <img width="2999" alt="image"
+src="https://user-images.githubusercontent.com/9376684/223300766-c1adebb9-bd19-4a38-9941-116185d8d39f.png">
+
+- `upjet_resource_ttr` histogram metric, showing average resource TTR for the
+  last 10m: <img width="2999" alt="image"
+  src="https://user-images.githubusercontent.com/9376684/223309711-edef690e-2a59-419b-bb93-8f837496bec8.png">
+
+- The median (0.5-quantile) for these TTR observations: <img width="3002"
+alt="image"
+src="https://user-images.githubusercontent.com/9376684/223309727-d1a0f4e2-1ed2-414b-be67-478a0575ee49.png">
+
+These samples have been collected by provisioning 10 [upbound/provider-aws]
+`cognitoidp.UserPool` resources by running the provider with a poll interval of
+1m. In these examples, one can observe that the resources were polled
+(reconciled) twice after they acquired the `Ready=True` condition and after
+that, they were destroyed.
+
+## Reference
+You can find a full reference of the exposed metrics from the Upjet-based
+providers [here](provider_metrics_help.txt).
+
+[Kubernetes controller-runtime]:
+    https://github.com/kubernetes-sigs/controller-runtime
+[upbound/provider-aws]: https://github.com/upbound/provider-aws
+[upbound/provider-azure]: https://github.com/upbound/provider-azure
+[upbound/provider-azuread]: https://github.com/upbound/provider-azuread
+[upbound/provider-gcp]: https://github.com/upbound/provider-gcp
+[`controller-runtime` client]:
+    https://github.com/kubernetes-sigs/controller-runtime/blob/60af59f5b22335516850ca11c974c8f614d5d073/pkg/metrics/client_go_adapter.go#L40
+[leader election client]:
+    https://github.com/kubernetes-sigs/controller-runtime/blob/60af59f5b22335516850ca11c974c8f614d5d073/pkg/metrics/leaderelection.go#L12
+[controller workqueues]:
+    https://github.com/kubernetes-sigs/controller-runtime/blob/60af59f5b22335516850ca11c974c8f614d5d073/pkg/metrics/workqueue.go#L40
+[labels]: https://prometheus.io/docs/practices/naming/#labels
diff --git a/docs/provider_metrics_help.txt b/docs/provider_metrics_help.txt
new file mode 100644
index 00000000..638a829c
--- /dev/null
+++ b/docs/provider_metrics_help.txt
@@ -0,0 +1,147 @@
+# HELP upjet_terraform_cli_duration Measures in seconds how long it takes a Terraform CLI invocation to complete
+# TYPE upjet_terraform_cli_duration histogram
+
+# HELP upjet_terraform_running_processes The number of running Terraform CLI and Terraform provider processes
+# TYPE upjet_terraform_running_processes gauge
+
+# HELP upjet_resource_ttr Measures in seconds the time-to-readiness (TTR) for managed resources
+# TYPE upjet_resource_ttr histogram
+
+# HELP upjet_terraform_active_cli_invocations The number of active (running) Terraform CLI invocations
+# TYPE upjet_terraform_active_cli_invocations gauge
+
+# HELP certwatcher_read_certificate_errors_total Total number of certificate read errors
+# TYPE certwatcher_read_certificate_errors_total counter
+
+# HELP certwatcher_read_certificate_total Total number of certificate reads
+# TYPE certwatcher_read_certificate_total counter
+
+# HELP controller_runtime_active_workers Number of currently used workers per controller
+# TYPE controller_runtime_active_workers gauge
+
+# HELP controller_runtime_max_concurrent_reconciles Maximum number of concurrent reconciles per controller
+# TYPE controller_runtime_max_concurrent_reconciles gauge
+
+# HELP controller_runtime_reconcile_errors_total Total number of reconciliation errors per controller
+# TYPE controller_runtime_reconcile_errors_total counter
+
+# HELP controller_runtime_reconcile_time_seconds Length of time per reconciliation per controller
+# TYPE controller_runtime_reconcile_time_seconds histogram
+
+# HELP controller_runtime_reconcile_total Total number of reconciliations per controller
+# TYPE controller_runtime_reconcile_total counter
+
+# HELP go_gc_duration_seconds A summary of the pause duration of garbage collection cycles.
+# TYPE go_gc_duration_seconds summary
+
+# HELP go_goroutines Number of goroutines that currently exist.
+# TYPE go_goroutines gauge
+
+# HELP go_info Information about the Go environment.
+# TYPE go_info gauge
+
+# HELP go_memstats_alloc_bytes Number of bytes allocated and still in use.
+# TYPE go_memstats_alloc_bytes gauge
+
+# HELP go_memstats_alloc_bytes_total Total number of bytes allocated, even if freed.
+# TYPE go_memstats_alloc_bytes_total counter
+
+# HELP go_memstats_buck_hash_sys_bytes Number of bytes used by the profiling bucket hash table.
+# TYPE go_memstats_buck_hash_sys_bytes gauge
+
+# HELP go_memstats_frees_total Total number of frees.
+# TYPE go_memstats_frees_total counter
+
+# HELP go_memstats_gc_sys_bytes Number of bytes used for garbage collection system metadata.
+# TYPE go_memstats_gc_sys_bytes gauge
+
+# HELP go_memstats_heap_alloc_bytes Number of heap bytes allocated and still in use.
+# TYPE go_memstats_heap_alloc_bytes gauge
+
+# HELP go_memstats_heap_idle_bytes Number of heap bytes waiting to be used.
+# TYPE go_memstats_heap_idle_bytes gauge
+
+# HELP go_memstats_heap_inuse_bytes Number of heap bytes that are in use.
+# TYPE go_memstats_heap_inuse_bytes gauge
+
+# HELP go_memstats_heap_objects Number of allocated objects.
+# TYPE go_memstats_heap_objects gauge
+
+# HELP go_memstats_heap_released_bytes Number of heap bytes released to OS.
+# TYPE go_memstats_heap_released_bytes gauge
+
+# HELP go_memstats_heap_sys_bytes Number of heap bytes obtained from system.
+# TYPE go_memstats_heap_sys_bytes gauge
+
+# HELP go_memstats_last_gc_time_seconds Number of seconds since 1970 of last garbage collection.
+# TYPE go_memstats_last_gc_time_seconds gauge
+
+# HELP go_memstats_lookups_total Total number of pointer lookups.
+# TYPE go_memstats_lookups_total counter
+
+# HELP go_memstats_mallocs_total Total number of mallocs.
+# TYPE go_memstats_mallocs_total counter
+
+# HELP go_memstats_mcache_inuse_bytes Number of bytes in use by mcache structures.
+# TYPE go_memstats_mcache_inuse_bytes gauge
+
+# HELP go_memstats_mcache_sys_bytes Number of bytes used for mcache structures obtained from system.
+# TYPE go_memstats_mcache_sys_bytes gauge
+
+# HELP go_memstats_mspan_inuse_bytes Number of bytes in use by mspan structures.
+# TYPE go_memstats_mspan_inuse_bytes gauge
+
+# HELP go_memstats_mspan_sys_bytes Number of bytes used for mspan structures obtained from system.
+# TYPE go_memstats_mspan_sys_bytes gauge
+
+# HELP go_memstats_next_gc_bytes Number of heap bytes when next garbage collection will take place.
+# TYPE go_memstats_next_gc_bytes gauge
+
+# HELP go_memstats_other_sys_bytes Number of bytes used for other system allocations.
+# TYPE go_memstats_other_sys_bytes gauge
+
+# HELP go_memstats_stack_inuse_bytes Number of bytes in use by the stack allocator.
+# TYPE go_memstats_stack_inuse_bytes gauge
+
+# HELP go_memstats_stack_sys_bytes Number of bytes obtained from system for stack allocator.
+# TYPE go_memstats_stack_sys_bytes gauge
+
+# HELP go_memstats_sys_bytes Number of bytes obtained from system.
+# TYPE go_memstats_sys_bytes gauge
+
+# HELP go_threads Number of OS threads created.
+# TYPE go_threads gauge
+
+# HELP rest_client_request_duration_seconds Request latency in seconds. Broken down by verb, and host.
+# TYPE rest_client_request_duration_seconds histogram
+
+# HELP rest_client_request_size_bytes Request size in bytes. Broken down by verb and host.
+# TYPE rest_client_request_size_bytes histogram
+
+# HELP rest_client_requests_total Number of HTTP requests, partitioned by status code, method, and host.
+# TYPE rest_client_requests_total counter
+
+# HELP rest_client_response_size_bytes Response size in bytes. Broken down by verb and host.
+# TYPE rest_client_response_size_bytes histogram
+
+# HELP workqueue_adds_total Total number of adds handled by workqueue
+# TYPE workqueue_adds_total counter
+
+# HELP workqueue_depth Current depth of workqueue
+# TYPE workqueue_depth gauge
+
+# HELP workqueue_longest_running_processor_seconds How many seconds has the longest running processor for workqueue been running.
+# TYPE workqueue_longest_running_processor_seconds gauge
+
+# HELP workqueue_queue_duration_seconds How long in seconds an item stays in workqueue before being requested
+# TYPE workqueue_queue_duration_seconds histogram
+
+# HELP workqueue_retries_total Total number of retries handled by workqueue
+# TYPE workqueue_retries_total counter
+
+# HELP workqueue_unfinished_work_seconds How many seconds of work has been done that is in progress and hasn't been observed by work_duration. Large values indicate stuck threads. One can deduce the number of stuck threads by observing the rate at which this increases.
+# TYPE workqueue_unfinished_work_seconds gauge
+
+# HELP workqueue_work_duration_seconds How long in seconds processing an item from workqueue takes.
+# TYPE workqueue_work_duration_seconds histogram
+