Skip to content

Commit

Permalink
chore(RHTAPWATCH-734): Alert routing namespace annotation
Browse files Browse the repository at this point in the history
Adds an "alert_routing_namespace" annotation on all available
alerts and tests to allow for alert-tagging in Slack.

Alerts which belongs to a specific team have an hard-coded
value for this annotation, while non-specific alerts derive
it from the namespace label.

Signed-off-by: Omer <[email protected]>
  • Loading branch information
Omeramsc committed Jan 25, 2024
1 parent c1b5f5e commit 445d110
Show file tree
Hide file tree
Showing 54 changed files with 144 additions and 12 deletions.
2 changes: 2 additions & 0 deletions rhobs/alerting/data_plane/prometheus.application_alerts.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ spec:
Application controller in Pod {{ $labels.pod }} for namespace
{{ $labels.namespace }} on cluster {{ $labels.source_cluster }} is failing to
successfully delete at least 95% of applications over the past hour
alert_route_namespace: 'application-service'
runbook_url: https://gitlab.cee.redhat.com/rhtap/docs/sop/-/blob/main/has/application-delete-failed.md
- alert: ApplicationCreationErrors
expr: |
Expand All @@ -42,4 +43,5 @@ spec:
Application controller in Pod {{ $labels.pod }} for namespace
{{ $labels.namespace }} on cluster {{ $labels.source_cluster }} is failing to
successfully create at least 95% of applications over the past hour
alert_route_namespace: 'application-service'
runbook_url: https://gitlab.cee.redhat.com/rhtap/docs/sop/-/blob/main/has/application-create-failed.md
2 changes: 2 additions & 0 deletions rhobs/alerting/data_plane/prometheus.component_alerts.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ spec:
Component controller in Pod {{ $labels.pod }} for namespace
{{ $labels.namespace }} on cluster {{ $labels.source_cluster }} is failing to
successfully delete at least 95% of components over the past hour
alert_route_namespace: '{{ $labels.namespace }}'
runbook_url: https://gitlab.cee.redhat.com/rhtap/docs/sop/-/blob/main/has/component-delete-failed.md
- alert: ComponentCreationErrors
expr: |
Expand All @@ -42,4 +43,5 @@ spec:
Component controller in Pod {{ $labels.pod }} for namespace
{{ $labels.namespace }} on cluster {{ $labels.source_cluster }} is failing to
successfully create at least 95% of components over the past hour
alert_route_namespace: '{{ $labels.namespace }}'
runbook_url: https://gitlab.cee.redhat.com/rhtap/docs/sop/-/blob/main/has/component-create-failed.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,4 +21,5 @@ spec:
Cluster-Agent Operations in non-complete state are too high. Got: {{ $value }}
description: >-
The sum of cluster-agent operations in non-complete state is greater than 10 on cluster {{ $labels.source_cluster }}
alert_route_namespace: "gitops-service-argocd"
runbook_url: https://gitlab.cee.redhat.com/rhtap/docs/sop/-/blob/main/gitops/cluster-agent-operations.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,4 +29,5 @@ spec:
Less than 95% of GitOps Deployments are in Healthy state: {{ $value | humanizePercentage }} in cluster: {{ $labels.source_cluster }}
description: >-
The percentage total of Argo CD Deployments that are in Healthy state is less than 95% in cluster: {{ $labels.source_cluster }}.
alert_route_namespace: "gitops-service-argocd"
runbook_url: https://gitlab.cee.redhat.com/rhtap/docs/sop/-/blob/main/gitops/deployments.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,4 +29,5 @@ spec:
Less than 95% of GitOps Routes are in Healthy state: {{ $value | humanizePercentage }} in cluster: {{ $labels.source_cluster }}
description: >-
The percentage total of Argo CD Routes that are in Healthy state is less than 95% in cluster: {{ $labels.source_cluster }}.
alert_route_namespace: "gitops-service-argocd"
runbook_url: https://gitlab.cee.redhat.com/rhtap/docs/sop/-/blob/main/gitops/routes.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,4 +29,5 @@ spec:
Less than 95% of GitOps StatefulSets are in Healthy state: {{ $value | humanizePercentage }} in cluster: {{ $labels.source_cluster }}
description: >-
The percentage total of Argo CD StatefulSets that are in Healthy state is less than 95% in cluster: {{ $labels.source_cluster }}.
alert_route_namespace: "gitops-service-argocd"
runbook_url: https://gitlab.cee.redhat.com/rhtap/docs/sop/-/blob/main/gitops/statefulsets.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,4 +31,5 @@ spec:
Less than 95% of GitOps applications are in synced state: {{ $value | humanizePercentage }}
description: >-
The percentage total of all Argo CD applications that are in Synced state is less than 95%.
alert_route_namespace: "gitops-service-argocd"
runbook_url: https://gitlab.cee.redhat.com/rhtap/docs/sop/-/blob/main/gitops/deploy-from-git-to-k8s.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,4 +31,5 @@ spec:
if PaC provision is requested upon the Component creation, then till the provision finishes has been over
60s for more than 10% of requests during the last 5 minutes on cluster
{{ $labels.source_cluster }}
alert_route_namespace: build-service
runbook_url: https://gitlab.cee.redhat.com/rhtap/docs/sop/-/blob/main/build-service/latency_component_onboarding.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,4 +30,5 @@ spec:
Time taken to provision image repository has been over
30s for more than 5% of requests during the last 5 minutes on cluster
{{ $labels.source_cluster }}
alert_route_namespace: image-controller
runbook_url: https://gitlab.cee.redhat.com/rhtap/docs/sop/-/blob/main/image-controller/latency_image_repository_provision.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,4 +30,5 @@ spec:
Time taken from PaC provision request till Component is provisioned for PaC builds has been over
20s for more than 5% of requests during the last 5 minutes on cluster
{{ $labels.source_cluster }}
alert_route_namespace: build-service
runbook_url: https://gitlab.cee.redhat.com/rhtap/docs/sop/-/blob/main/build-service/latency_pac_provision.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,4 +30,5 @@ spec:
Time taken from PaC unprovision request till Component is unprovisioned for PaC builds has been over
20s for more than 5% of requests during the last 5 minutes on cluster
{{ $labels.source_cluster }}
alert_route_namespace: build-service
runbook_url: https://gitlab.cee.redhat.com/rhtap/docs/sop/-/blob/main/build-service/latency_pac_unprovision.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,4 +30,5 @@ spec:
Time from Snapshot marked as passed to release created has been over
10s for more than 10% of requests during the last 5 minutes on cluster
{{ $labels.source_cluster }}
alert_route_namespace: integration-service
runbook_url: https://gitlab.cee.redhat.com/rhtap/docs/sop/-/blob/main/integration-service/latency_release_creation.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,4 +30,5 @@ spec:
Time taken from simple build request till the build pipeline is submitted has been over
15s for more than 5% of requests during the last 5 minutes on cluster
{{ $labels.source_cluster }}
alert_route_namespace: build-service
runbook_url: https://gitlab.cee.redhat.com/rhtap/docs/sop/-/blob/main/build-service/latency_simple_build.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,4 +30,5 @@ spec:
Time from Snapshot created to integration PLRs in static envs created has been over
5s for {{ $value | humanizePercentage }} of requests (tolerance 10%) during the last 5 minutes on cluster
{{ $labels.source_cluster }}
alert_route_namespace: integration-service
runbook_url: https://gitlab.cee.redhat.com/rhtap/docs/sop/-/blob/main/integration-service/latency_snapshot_to_integration_test_static.md
1 change: 1 addition & 0 deletions rhobs/alerting/data_plane/prometheus.oauth_alerts.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -24,4 +24,5 @@ spec:
description: >-
The average OAuth login time on cluster {{ $labels.source_cluster }} has
{{ $value }} sec for the last 5 minutes
alert_route_namespace: spi-system
runbook_url: https://gitlab.cee.redhat.com/rhtap/docs/sop/-/blob/main/spi/oauth_login.md
2 changes: 2 additions & 0 deletions rhobs/alerting/data_plane/prometheus.pipeline_alerts.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ spec:
description: >-
Tekton controller on cluster {{ $labels.source_cluster }} the percentage of time needed to receive PipelineRun creation
events vs. overall PipelineRun execution time is at {{ $value | humanizePercentage }} instead of less than 5%.
alert_route_namespace: plnsvc-tests
runbook_url: TBD
- alert: HighExecutionOverhead
expr: |
Expand All @@ -42,4 +43,5 @@ spec:
description: >-
Tekton controller on cluster {{ $labels.source_cluster }} the percentage of the time needed to create
underlying TaskRuns vs. overall PipelineRun execution time is at {{ $value | humanizePercentage }} instead of less than 5%.
alert_route_namespace: plnsvc-tests
runbook_url: TBD
Original file line number Diff line number Diff line change
Expand Up @@ -30,4 +30,5 @@ spec:
Time from pipeline run finished to snapshot marked in progress has been over
30s for more than 10% of requests during the last 5 minutes on cluster
{{ $labels.source_cluster }}
alert_route_namespace: perf-team-prometheus-reader
runbook_url: https://gitlab.cee.redhat.com/rhtap/docs/sop/-/blob/main/integration-service/pipeline_to_snapshot_exceeded.md
1 change: 1 addition & 0 deletions rhobs/alerting/data_plane/prometheus.pv_alerts.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -20,4 +20,5 @@ spec:
description: >-
Persistent Volume {{ $labels.persistentvolume }} in namespace {{ $labels.namespace }} on cluster {{ $labels.source_cluster }}
is in {{ $labels.phase }} phase for more than 5 minutes.
alert_route_namespace: '{{ $labels.namespace }}'
runbook_url: https://gitlab.cee.redhat.com/rhtap/docs/sop/-/blob/main/o11y/alert-rule-pesistentVolumeIssues.md
1 change: 1 addition & 0 deletions rhobs/alerting/data_plane/prometheus.quota_alerts.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -25,4 +25,5 @@ spec:
Resource {{ $labels.resource }} in namespace {{ $labels.namespace }}
on cluster {{ $labels.source_cluster }} exceeded quota
{{ $labels.resourcequota }}.
alert_route_namespace: '{{ $labels.namespace }}'
runbook_url: https://gitlab.cee.redhat.com/rhtap/docs/sop/-/blob/main/o11y/alert-rule-QuotaExceeded.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,7 @@ spec:
90% of Releases must be processed under one hour
description: >-
Release service is failing to successfully process within the period of one hour for 90% of releases
alert_route_namespace: release-service

- alert: ReleaseServicePreProcessingDurationSeconds
expr: |
Expand All @@ -60,6 +61,7 @@ spec:
90% of Releases must start processing under 10 seconds
description: >-
Release service is failing to start processing under 10 seconds for 90% of releases
alert_route_namespace: release-service

- alert: ReleaseServiceValidationDurationSeconds
expr: |
Expand All @@ -75,3 +77,4 @@ spec:
90% of Releases must be validated under 5 seconds
description: >-
Release service is failing to run the validations under 5 seconds for 90% of releases
alert_route_namespace: release-service
Original file line number Diff line number Diff line change
Expand Up @@ -30,4 +30,5 @@ spec:
Time from Snapshot Environment Binding created to marked as
ready has been over 120s for more than 10% of requests during
the last 5 minutes on cluster {{ $labels.source_cluster }}
alert_route_namespace: integration-service
runbook_url: https://gitlab.cee.redhat.com/rhtap/docs/sop/-/blob/main/integration-service/seb_created_to_ready.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,4 +28,5 @@ spec:
Application controller in Pod {{ $labels.pod }} for namespace
{{ $labels.namespace }} on instance {{ $labels.source_cluster }} having a
{{ $value | humanizePercentage }} of 5xx errors from service provider {{ $labels.sp }} for latest 60 minutes
alert_route_namespace: spi-system
runbook_url: https://gitlab.cee.redhat.com/rhtap/docs/sop/-/blob/main/spi/alert-rule-serviceprovider5xxErrorsRate.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,4 +30,5 @@ spec:
Time taken to provision image repository has been over
5 minutes for more than 1% of requests during the last 10 minutes on cluster
{{ $labels.source_cluster }}
alert_route_namespace: image-controller
runbook_url: https://gitlab.cee.redhat.com/rhtap/docs/sop/-/blob/main/image-controller/stability_image_repository_provision.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,4 +22,5 @@ spec:
description: >
Provision image repository failures occured for more than 5 requests during the last 30 minutes
{{ $labels.source_cluster }}
alert_route_namespace: image-controller
runbook_url: https://gitlab.cee.redhat.com/rhtap/docs/sop/-/blob/main/image-controller/stability_image_repository_provision_failures.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,4 +30,5 @@ spec:
Time taken from PaC provision request till Component is provisioned for PaC builds has been over
5 minutes for more than 1% of requests during the last 10 minutes on cluster
{{ $labels.source_cluster }}
alert_route_namespace: build-service
runbook_url: https://gitlab.cee.redhat.com/rhtap/docs/sop/-/blob/main/build-service/stability_pac_provision.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,4 +30,5 @@ spec:
Time taken from PaC unprovision request till Component is unprovisioned for PaC builds has been over
5 minutes for more than 1% of requests during the last 10 minutes on cluster
{{ $labels.source_cluster }}
alert_route_namespace: build-service
runbook_url: https://gitlab.cee.redhat.com/rhtap/docs/sop/-/blob/main/build-service/stability_pac_unprovision.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,4 +30,5 @@ spec:
Time taken from simple build request till the build pipeline is submitted has been over
5 minutes for more than 1% of requests during the last 10 minutes on cluster
{{ $labels.source_cluster }}
alert_route_namespace: build-service
runbook_url: https://gitlab.cee.redhat.com/rhtap/docs/sop/-/blob/main/build-service/stability_simple_build.md
4 changes: 4 additions & 0 deletions test/promql/tests/data_plane/application_errors_test.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ tests:
Application controller in Pod has for namespace
application-service on cluster cluster01 is failing to
successfully delete at least 95% of applications over the past hour
alert_route_namespace: 'application-service'
runbook_url: https://gitlab.cee.redhat.com/rhtap/docs/sop/-/blob/main/has/application-delete-failed.md

- interval: 1m
Expand Down Expand Up @@ -59,6 +60,7 @@ tests:
Application controller in Pod has for namespace
application-service on cluster cluster01 is failing to
successfully delete at least 95% of applications over the past hour
alert_route_namespace: application-service
runbook_url: https://gitlab.cee.redhat.com/rhtap/docs/sop/-/blob/main/has/application-delete-failed.md

- interval: 1m
Expand Down Expand Up @@ -115,6 +117,7 @@ tests:
Application controller in Pod has for namespace
application-service on cluster cluster01 is failing to
successfully create at least 95% of applications over the past hour
alert_route_namespace: application-service
runbook_url: https://gitlab.cee.redhat.com/rhtap/docs/sop/-/blob/main/has/application-create-failed.md

- interval: 1m
Expand Down Expand Up @@ -143,6 +146,7 @@ tests:
Application controller in Pod has for namespace
application-service on cluster cluster01 is failing to
successfully create at least 95% of applications over the past hour
alert_route_namespace: application-service
runbook_url: https://gitlab.cee.redhat.com/rhtap/docs/sop/-/blob/main/has/application-create-failed.md

- interval: 1m
Expand Down
6 changes: 6 additions & 0 deletions test/promql/tests/data_plane/component_errors_test.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,7 @@ tests:
Component controller in Pod has for namespace
application-service on cluster cluster01 is failing to
successfully delete at least 95% of components over the past hour
alert_route_namespace: application-service
runbook_url: https://gitlab.cee.redhat.com/rhtap/docs/sop/-/blob/main/has/component-delete-failed.md

- interval: 1m
Expand Down Expand Up @@ -69,6 +70,7 @@ tests:
Component controller in Pod has for namespace
application-service on cluster cluster01 is failing to
successfully delete at least 95% of components over the past hour
alert_route_namespace: application-service
runbook_url: https://gitlab.cee.redhat.com/rhtap/docs/sop/-/blob/main/has/component-delete-failed.md

- interval: 1m
Expand Down Expand Up @@ -98,6 +100,7 @@ tests:
Component controller in Pod has for namespace
application-service on cluster cluster01 is failing to
successfully delete at least 95% of components over the past hour
alert_route_namespace: application-service
runbook_url: https://gitlab.cee.redhat.com/rhtap/docs/sop/-/blob/main/has/component-delete-failed.md

- interval: 1m
Expand Down Expand Up @@ -164,6 +167,7 @@ tests:
Component controller in Pod has for namespace
application-service on cluster cluster01 is failing to
successfully create at least 95% of components over the past hour
alert_route_namespace: application-service
runbook_url: https://gitlab.cee.redhat.com/rhtap/docs/sop/-/blob/main/has/component-create-failed.md

- interval: 1m
Expand Down Expand Up @@ -192,6 +196,7 @@ tests:
Component controller in Pod has for namespace
application-service on cluster cluster01 is failing to
successfully create at least 95% of components over the past hour
alert_route_namespace: application-service
runbook_url: https://gitlab.cee.redhat.com/rhtap/docs/sop/-/blob/main/has/component-create-failed.md

- interval: 1m
Expand Down Expand Up @@ -221,6 +226,7 @@ tests:
Component controller in Pod has for namespace
application-service on cluster cluster01 is failing to
successfully create at least 95% of components over the past hour
alert_route_namespace: application-service
runbook_url: https://gitlab.cee.redhat.com/rhtap/docs/sop/-/blob/main/has/component-create-failed.md

- interval: 1m
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@ tests:
Cluster-Agent Operations in non-complete state are too high. Got: 11
description: >-
The sum of cluster-agent operations in non-complete state is greater than 10 on cluster cluster01
alert_route_namespace: "gitops-service-argocd"
runbook_url: https://gitlab.cee.redhat.com/rhtap/docs/sop/-/blob/main/gitops/cluster-agent-operations.md

- interval: 1m
Expand All @@ -45,6 +46,7 @@ tests:
Cluster-Agent Operations in non-complete state are too high. Got: 10
description: >-
The sum of cluster-agent operations in non-complete state is greater than 10 on cluster cluster01
alert_route_namespace: "gitops-service-argocd"
runbook_url: https://gitlab.cee.redhat.com/rhtap/docs/sop/-/blob/main/gitops/cluster-agent-operations.md

- interval: 1m
Expand Down
Loading

0 comments on commit 445d110

Please sign in to comment.