CASMMON-446

Cray-HPE · Oct 18, 2024 · 4438003 · 4438003
1 parent 3c6473c
commit 4438003
Show file tree

Hide file tree

Showing 18 changed files with 191 additions and 286 deletions.
diff --git a/.spelling b/.spelling
@@ -29,7 +29,12 @@ DaemonSet
 DaemonSets
 StatefulSet
 StatefulSets
-
+VictoriaMetrics
+Vmagent
+Vmalert
+Vminsert
+Vmselect
+Vmstorage
 
 1.0.x
 1.2.x
@@ -1071,9 +1076,6 @@ Init
 # To allow System Admin Toolkit
 Admin
 
-- operations/system_management_health/Configure_Prometheus_Alerta_Alert_Notifications.md
-Alerta
-
 - operations/system_management_health/thanos.md
 gRPC
 spec

diff --git a/operations/CSM_product_management/Post_Install_Customizations.md b/operations/CSM_product_management/Post_Install_Customizations.md
@@ -57,13 +57,12 @@ Check to see if there are any recent out of memory events.
 
 1. Search for the "Kubernetes / Compute Resources / Pod" dashboard to view the memory utilization graphs over time for any pod that has been `OOMKilled`.
 
-## Prometheus `CPUThrottlingHigh` alerts
+## vmstorage `CPUThrottlingHigh` alerts
 
 Check Prometheus for recent `CPUThrottlingHigh` alerts.
 
-1. Log in to Prometheus at the following URL: `https://prometheus.cmn.SYSTEM_DOMAIN_NAME/`
+1. Log in to vmalert at the following URL: `https://vmselect..cmn.SYSTEM_DOMAIN_NAME/select/0/prometheus/vmalert/api/v1/alerts`
 
- 1. Select the **Alert** tab.
 
  1. Scroll down to the alert for `CPUThrottlingHigh`.
 
@@ -86,7 +85,7 @@ Use Grafana to investigate and analyze CPU throttling and memory usage.
  ```yaml
  datasource: default
  namespace: sysmgmt-health
- pod: prometheus-cray-sysmgmt-health-kube-p-prometheus-0
+ pod: vmstorage-vms-0
  ```
 
 ### CPU throttling
@@ -122,12 +121,12 @@ Use Grafana to investigate and analyze CPU throttling and memory usage.
 ## Common customization scenarios
 
 * [Prerequisites](#prerequisites)
-* [Prometheus pod is `OOMKilled` or CPU throttled](#prometheus-pod-is-oomkilled-or-cpu-throttled)
+* [vmstorage pods are `OOMKilled` or CPU throttled](#prometheus-pod-is-oomkilled-or-cpu-throttled)
 * [Postgres pods are `OOMKilled` or CPU throttled](#postgres-pods-are-oomkilled-or-cpu-throttled)
 * [Scale `cray-bss` service](#scale-cray-bss-service)
 * [Scale `cray-dns-unbound` service](#scale-cray-dns-unbound-service)
 * [Postgres PVC resize](#postgres-pvc-resize)
-* [Prometheus PVC resize](#prometheus-pvc-resize)
+* [vmstorage PVC resize](#prometheus-pvc-resize)
 * [`cray-hms-hmcollector` pods are `OOMKilled`](#cray-hms-hmcollector-pods-are-oomkilled)
 * [`cray-cfs-api` pods are `OOMKilled`](#cray-cfs-api-pods-are-oomkilled)
 * [References](#references)
@@ -139,7 +138,7 @@ procedure for a specific chart. In these cases, the section on this page provide
 information necessary in order to carry out that procedure. It is recommended to keep both pages open
 in different browser windows for easy reference.
 
-### Prometheus pod is `OOMKilled` or CPU throttled
+### Vmstorage pod is `OOMKilled` or CPU throttled
 
 Update resources associated with Prometheus in the `sysmgmt-health` namespace.
 This example is based on what was needed for a system with 4000 compute nodes.
@@ -153,41 +152,41 @@ Follow the [Redeploying a Chart](Redeploying_a_Chart.md) procedure **with the fo
 
  **Only follow these steps as part of the previously linked chart redeploy procedure.**
 
- 1. Edit the customizations by adding or updating `spec.kubernetes.services.cray-sysmgmt-health.kube-prometheus-stack.prometheus.prometheusSpec.resources`.
+ 1. Edit the customizations by adding or updating `spec.kubernetes.services.cray-sysmgmt-health.victoria-metrics-k8s-stack.vmcluster.spec.vmstorage.resources`.
 
  * If the number of NCNs is less than 20, then:
 
  ```bash
- yq write -i customizations.yaml 'spec.kubernetes.services.cray-sysmgmt-health.kube-prometheus-stack.prometheus.prometheusSpec.resources.requests.cpu' --style=double '2'
- yq write -i customizations.yaml 'spec.kubernetes.services.cray-sysmgmt-health.kube-prometheus-stack.prometheus.prometheusSpec.resources.requests.memory' '15Gi'
- yq write -i customizations.yaml 'spec.kubernetes.services.cray-sysmgmt-health.kube-prometheus-stack.prometheus.prometheusSpec.resources.limits.cpu' --style=double '6'
- yq write -i customizations.yaml 'spec.kubernetes.services.cray-sysmgmt-health.kube-prometheus-stack.prometheus.prometheusSpec.resources.limits.memory' '30Gi'
+ yq write -i customizations.yaml 'spec.kubernetes.services.cray-sysmgmt-health.victoria-metrics-k8s-stack.vmcluster.spec.vmstorage.resources.requests.cpu' --style=double '4'
+ yq write -i customizations.yaml 'spec.kubernetes.services.cray-sysmgmt-health.victoria-metrics-k8s-stack.vmcluster.spec.vmstorage.resources.requests.memory' '8Gi'
+ yq write -i customizations.yaml 'spec.kubernetes.services.cray-sysmgmt-health.victoria-metrics-k8s-stack.vmcluster.spec.vmstorage.resources.limits.cpu' --style=double '8'
+ yq write -i customizations.yaml 'spec.kubernetes.services.cray-sysmgmt-health.victoria-metrics-k8s-stack.vmcluster.spec.vmstorage.resources.limits.memory' '16Gi'
  ```
 
  * If the number of NCNs is 20 or more, then:
 
  ```bash
- yq write -i customizations.yaml 'spec.kubernetes.services.cray-sysmgmt-health.kube-prometheus-stack.prometheus.prometheusSpec.resources.requests.cpu' --style=double '6'
- yq write -i customizations.yaml 'spec.kubernetes.services.cray-sysmgmt-health.kube-prometheus-stack.prometheus.prometheusSpec.resources.requests.memory' '50Gi'
- yq write -i customizations.yaml 'spec.kubernetes.services.cray-sysmgmt-health.kube-prometheus-stack.prometheus.prometheusSpec.resources.limits.cpu' --style=double '12'
- yq write -i customizations.yaml 'spec.kubernetes.services.cray-sysmgmt-health.kube-prometheus-stack.prometheus.prometheusSpec.resources.limits.memory' '60Gi'
+ yq write -i customizations.yaml 'spec.kubernetes.services.cray-sysmgmt-health.victoria-metrics-k8s-stack.vmcluster.spec.vmstorage.resources.requests.cpu' --style=double '6'
+ yq write -i customizations.yaml 'spec.kubernetes.services.cray-sysmgmt-health.victoria-metrics-k8s-stack.vmcluster.spec.vmstorage.resources.requests.memory' '16Gi'
+ yq write -i customizations.yaml 'spec.kubernetes.services.cray-sysmgmt-health.victoria-metrics-k8s-stack.vmcluster.spec.vmstorage.resources.limits.cpu' --style=double '12'
+ yq write -i customizations.yaml 'spec.kubernetes.services.cray-sysmgmt-health.victoria-metrics-k8s-stack.vmcluster.spec.vmstorage.resources.limits.memory' '32Gi'
  ```
 
  1. Check that the customization file has been updated.
 
  ```bash
- yq read customizations.yaml 'spec.kubernetes.services.cray-sysmgmt-health.kube-prometheus-stack.prometheus.prometheusSpec.resources'
+ yq read customizations.yaml 'spec.kubernetes.services.cray-sysmgmt-health.victoria-metrics-k8s-stack.vmcluster.spec.vmstorage.resources'
  ```
 
  Example output:
 
  ```yaml
  requests:
- cpu: "3"
- memory: 15Gi
+ cpu: "4"
+ memory: 8Gi
  limits:
- cpu: "6"
- memory: 30Gi
+ cpu: "8"
+ memory: 16Gi
  ```
 
 * (`ncn-mw#`) When reaching the step to validate the redeployed chart, perform the following steps:
@@ -196,23 +195,24 @@ Follow the [Redeploying a Chart](Redeploying_a_Chart.md) procedure **with the fo
 
  1. Verify that the pod restarts and that the desired resources have been applied.
 
- Watch the `prometheus-cray-sysmgmt-health-kube-p-prometheus-0` pod restart.
+ Watch the `vmstorage-vms` pods restart.
 
  ```bash
- watch "kubectl get pods -n sysmgmt-health -l prometheus=cray-sysmgmt-health-kube-p-prometheus"
+ watch "kubectl get pods -n sysmgmt-health -l app.kubernetes.io/name=vmstorage"
  ```
 
- It may take about 10 minutes for the `prometheus-cray-sysmgmt-health-kube-p-prometheus-0` pod to terminate.
+ It may take about 10 minutes for the `vmstorage-vms-*` pods to terminate.
  It can be forced deleted if it remains in the terminating state:
 
  ```bash
- kubectl delete pod prometheus-cray-sysmgmt-health-kube-p-prometheus-0 --force --grace-period=0 -n sysmgmt-health
+ kubectl delete pod vmstorage-vms-0 --force --grace-period=0 -n sysmgmt-health
+ kubectl delete pod vmstorage-vms-1 --force --grace-period=0 -n sysmgmt-health
  ```
 
  1. Verify that the resource changes are in place.
 
  ```bash
- kubectl get pod prometheus-cray-sysmgmt-health-kube-p-prometheus-0 -n sysmgmt-health -o json | jq -r '.spec.containers[] | select(.name == "prometheus").resources'
+ kubectl get pod vmstorage-vms-0 -n sysmgmt-health -o json | jq -r '.spec.containers[] | select(.name == "vmstorage").resources'
  ```
 
 * **Make sure to perform the entire linked procedure, including the step to save the updated customizations.**
@@ -525,9 +525,9 @@ Using the values from the above table, follow the [Redeploying a Chart](Redeploy
 
 * **Make sure to perform the entire linked procedure, including the step to save the updated customizations.**
 
-### Prometheus PVC resize
+### Vmselect PVC resize
 
-Increase the PVC volume size associated with `prometheus-cray-sysmgmt-health-kube-p-prometheus` cluster in the `sysmgmt-health` namespace.
+Increase the PVC volume size associated with `vmstorage` cluster in the `sysmgmt-health` namespace.
 This example is based on what was needed for a system with more than 20 non compute nodes (NCNs). The PVC size can only ever be increased.
 
 Follow the [Redeploying a Chart](Redeploying_a_Chart.md) procedure **with the following specifications**:
@@ -538,16 +538,16 @@ Follow the [Redeploying a Chart](Redeploying_a_Chart.md) procedure **with the fo
 
  **Only follow these steps as part of the previously linked chart redeploy procedure.**
 
- 1. Edit the customizations by adding or updating `spec.kubernetes.services.cray-sysmgmt-health.kube-prometheus-stack.prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage`.
+ 1. Edit the customizations by adding or updating `spec.kubernetes.services.cray-sysmgmt-health.victoria-metrics-k8s-stack.vmcluster.spec.vmstorage.storage.volumeClaimTemplate.spec.resources.requests.storage`.
 
  ```bash
- yq write -i customizations.yaml 'spec.kubernetes.services.cray-sysmgmt-health.kube-prometheus-stack.prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage' '300Gi'
+ yq write -i customizations.yaml 'spec.kubernetes.services.cray-sysmgmt-health.victoria-metrics-k8s-stack.vmcluster.spec.vmstorage.storage.volumeClaimTemplate.spec.resources.requests.storage' '300Gi'
  ```
 
  1. Check that the customization file has been updated.
 
  ```bash
- yq read customizations.yaml 'spec.kubernetes.services.cray-sysmgmt-health.kube-prometheus-stack.prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage'
+ yq read customizations.yaml 'spec.kubernetes.services.cray-sysmgmt-health.victoria-metrics-k8s-stack.vmcluster.spec.vmstorage.storage.volumeClaimTemplate.spec.resources.requests.storage'
  ```
 
  Example output:
@@ -563,14 +563,14 @@ Follow the [Redeploying a Chart](Redeploying_a_Chart.md) procedure **with the fo
  Verify that the increased volume size has been applied.
 
  ```bash
- watch "kubectl get pvc -n sysmgmt-health prometheus-cray-sysmgmt-health-kube-p-prometheus-db-prometheus-cray-sysmgmt-health-kube-p-prometheus-0"
+ watch "kubectl get pvc -n sysmgmt-health vmstorage-db-vmstorage-vms-0"
  ```
 
  Example output:
 
  ```text
- NAME  STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
- prometheus-cray-sysmgmt-health-kube-p-prometheus-db-prometheus-cray-sysmgmt-health-kube-p-prometheus-0 Bound pvc-bcb8f4f1-fb84-4b48-95c7-63508ef18962 200Gi RWO k8s-block-replicated 3d2h
+ NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
+ vmstorage-db-vmstorage-vms-0  Bound pvc-092805e3-ac92-438e-b77b-a0639096f5f0 200Gi RWO k8s-block-replicated 3d2h
  ```
 
  At this point the Prometheus cluster is healthy, but additional steps are required to complete the resize of the Prometheus PVCs.

diff --git a/operations/README.md b/operations/README.md
@@ -454,7 +454,6 @@ confident that a lack of issues indicates the system is operating normally.
 - [System Management Health](system_management_health/System_Management_Health.md)
 - [System Management Health Checks and Alerts](system_management_health/System_Management_Health_Checks_and_Alerts.md)
 - [Access System Management Health Services](system_management_health/Access_System_Management_Health_Services.md)
-- [Configure Prometheus Alerta Alert Notifications](system_management_health/Configure_Prometheus_Alerta_Alert_Notifications.md)
 - [Configure Prometheus Email Alert Notifications](system_management_health/Configure_Prometheus_Email_Alert_Notifications.md)
 - [Grafana Dashboards by Component](system_management_health/Grafana_Dashboards_by_Component.md)
  - [Troubleshoot Grafana Dashboard](system_management_health/Troubleshoot_Grafana_Dashboard.md)

diff --git a/operations/kubernetes/Determine_if_Pods_are_Hitting_Resource_Limits.md b/operations/kubernetes/Determine_if_Pods_are_Hitting_Resource_Limits.md
@@ -80,17 +80,17 @@ Identify pods that are hitting resource limits in order to increase the resource
  Example output:
 
  ```text
- default 54m Warning OOMKilling node/ncn-w003 Memory cgroup out of memory: Kill process 1223856 (prometheus) score 1966 or sacrifice child
- default 44m Warning OOMKilling node/ncn-w003 Memory cgroup out of memory: Kill process 1372634 (prometheus) score 1966 or sacrifice child
+ default 54m Warning OOMKilling node/ncn-w003 Memory cgroup out of memory: Kill process 1223856 (vmstorage) score 1966 or sacrifice child
+ default 44m Warning OOMKilling node/ncn-w003 Memory cgroup out of memory: Kill process 1372634 (vmstorage) score 1966 or sacrifice child
  ```
 
  1. (`ncn-mw#`) Determine which pod was killed using the output of the previous command.
 
  Search the pods in Kubernetes for the string returned in the previous step to find the exact pod name.
- Based on the previous example command output, `prometheus` is used in this example:
+ Based on the previous example command output, `vmstorage` is used in this example:
 
  ```bash
- kubectl get pod -A | grep prometheus
+ kubectl get pod -A | grep vmstorage
  ```
 
 1. Increase the resource limits for the pods identified in this procedure.

diff --git a/operations/network/customer_accessible_networks/Customer_Accessible_Networks.md b/operations/network/customer_accessible_networks/Customer_Accessible_Networks.md
@@ -8,7 +8,7 @@ The Customer Management Network \(CMN\) provides access from outside the custome
 
 - Administrator clients outside of the system:
  - Log in to NCNs.
- - Access administrative web UIs within the system \(e.g. Prometheus, Grafana, and more\).
+ - Access administrative web UIs within the system \(e.g. Vmselect, Grafana, and more\).
  - Access the administrative REST APIs.
  - Access a DNS server within the system for resolution of names for the webUI and REST API services.
  - Run administrative Cray CLI commands from outside the system.

diff --git a/operations/network/customer_accessible_networks/Externally_Exposed_Services.md b/operations/network/customer_accessible_networks/Externally_Exposed_Services.md
@@ -24,7 +24,7 @@ See [External DNS](../external_dns/External_DNS.md) for more information.
 | OAuth2 Proxy Ingress - CMN |   | customer-management | Yes | 443 |   |
 | OAuth2 Proxy Ingress - CAN |   | customer-access | Yes | 443 |   |
 | OAuth2 Proxy Ingress - CHN |   | customer-high-speed | Yes | 443 |   |
-| System Management Health Prometheus | `prometheus` | |   | No | Uses the IP address of OAuth2 Proxy Ingress (CMN) |
+| System Management Health Vmselect | `vmselect` | |   | No | Uses the IP address of OAuth2 Proxy Ingress (CMN) |
 | System Management Health Alert Manager | `alertmanager` | |   | No | Uses the IP address of OAuth2 Proxy Ingress (CMN) |
 | System Management Health Grafana | `grafana` | |   | No | Uses the IP address of OAuth2 Proxy Ingress (CMN) |
 | Istio Kiali | `kiali-istio` | |   | No | Uses the IP address of OAuth2 Proxy Ingress (CMN) |

diff --git a/operations/network/dns/PowerDNS_Configuration.md b/operations/network/dns/PowerDNS_Configuration.md
@@ -31,13 +31,13 @@ cray-dns-powerdns-nmn-udp LoadBalancer 10.17.203.241 10.92.100.85 53:318
 
 A system administrator would typically setup the subdomain `system.dev.cray.com` in their site DNS and create a record which points to the IP address `10.101.8.113`, for example `ins1.system.dev.cray.com`.
 
-The administrator would then delegate queries to `system.dev.cray.com` to `ins1.system.dev.cray.com` making it authoritative for that subdomain allowing CSM to respond to queries for services like `prometheus.system.dev.cray.com`
+The administrator would then delegate queries to `system.dev.cray.com` to `ins1.system.dev.cray.com` making it authoritative for that subdomain allowing CSM to respond to queries for services like `grafana.system.dev.cray.com`
 
 The specifics of how to configure to configuring DNS forwarding is dependent on the DNS server in use, please consult the documentation provided by the DNS server vendor for more information.
 
 ## Authoritative Zone Transfer
 
-In addition to responding to external DNS queries, PowerDNS can support replication of domain information to secondary servers via AXFR (Authoritative Zone Transfer) queries.
+In addition to responding to external DNS queries, PowerDNS can support replication of domain information to secondary servers via AXFR (Authoritative Zone Transfer) queries..
 
 ### Configuration parameters
 

diff --git a/operations/network/dns/PowerDNS_migration.md b/operations/network/dns/PowerDNS_migration.md
@@ -59,7 +59,7 @@ The following table of examples assumes that the system was configured with a `s
 | auth.shasta.dev.cray.com | auth.cmn.shasta.dev.cray.com | | No |
 | nexus.shasta.dev.cray.com | nexus.cmn.shasta.dev.cray.com | | No |
 | grafana.shasta.dev.cray.com | grafana.cmn.shasta.dev.cray.com | | No |
-| prometheus.shasta.dev.cray.com | prometheus.cmn.shasta.dev.cray.com | | No |
+| vmselect.shasta.dev.cray.com  | vmselect.cmn.shasta.dev.cray.com  | | No |
 | alertmanager.shasta.dev.cray.com | alertmanager.cmn.shasta.dev.cray.com | | No |
 | vcs.shasta.dev.cray.com | vcs.cmn.shasta.dev.cray.com | | No |
 | kiali-istio.shasta.dev.cray.com | kiali-istio.cmn.shasta.dev.cray.com | | No |