CASMMON-446

Cray-HPE · Oct 18, 2024 · 7a630ab · 7a630ab
1 parent 4438003
commit 7a630ab
Show file tree

Hide file tree

Showing 10 changed files with 113 additions and 255 deletions.
diff --git a/operations/README.md b/operations/README.md
@@ -457,7 +457,6 @@ confident that a lack of issues indicates the system is operating normally.
 - [Configure Prometheus Email Alert Notifications](system_management_health/Configure_Prometheus_Email_Alert_Notifications.md)
 - [Grafana Dashboards by Component](system_management_health/Grafana_Dashboards_by_Component.md)
  - [Troubleshoot Grafana Dashboard](system_management_health/Troubleshoot_Grafana_Dashboard.md)
-- [Grafterm](system_management_health/Grafterm.md)
 - [Remove Kiali](system_management_health/Remove_Kiali.md)
 - [`prometheus-kafka-adapter` errors during installation](system_management_health/Prometheus_Kafka_Error.md)
 - [`grok-exporter` errors during installation](system_management_health/Grok-Exporter_Error.md)

diff --git a/operations/kubernetes/Cert_Renewal_for_Kubernetes_and_Bare_Metal_EtcD.md b/operations/kubernetes/Cert_Renewal_for_Kubernetes_and_Bare_Metal_EtcD.md
@@ -565,26 +565,27 @@ Run the following steps from a master node.
  1. Restart Prometheus.
 
  ```bash
- kubectl rollout restart -n sysmgmt-health statefulSet/prometheus-cray-sysmgmt-health-kube-p-prometheus
- kubectl rollout status -n sysmgmt-health statefulSet/prometheus-cray-sysmgmt-health-kube-p-prometheus
+ kubectl rollout restart deployment -n sysmgmt-health vmagent-vms-0
+ kubectl rollout status -n sysmgmt-health deployment.apps/vmagent-vms-0
+ kubectl rollout restart deployment -n sysmgmt-health vmagent-vms-1
+ kubectl rollout status -n sysmgmt-health deployment.apps/vmagent-vms-1
  ```
 
  Example output:
 
  ```text
- Waiting for 1 pods to be ready...
- statefulset rolling update complete ...
+ deployment "vmagent-vms-0" successfully rolled out
  ```
 
  1. Check for any `tls` errors from the active Prometheus targets. No errors are expected.
 
  ```bash
- PROM_IP=$(kubectl get services -n sysmgmt-health cray-sysmgmt-health-kube-p-prometheus -o json | jq -r '.spec.clusterIP')
- curl -s http://${PROM_IP}:9090/api/v1/targets | jq -r '.data.activeTargets[] | select(."scrapePool" == "sysmgmt-health/cray-sysmgmt-health-kube-p-kube-etcd/0")' | grep lastError | sort -u
+ PROM_IP=$(kubectl get services -n sysmgmt-health vmagent-vms -o json | jq -r '.spec.clusterIP')
+ curl -s http://${PROM_IP}:8429/targets |  grep kube-etcd | sort -u 
  ```
 
  Example output:
 
  ```text
- "lastError": "",
+ state=up, endpoint=https://10.252.1.10:2379/metrics, labels={endpoint="http-metrics",instance="10.252.1.10:2379",job="kube-etcd",namespace="kube-system",service="vms-kube-etcd"}, scrapes_total=28114, scrapes_failed=0, last_scrape=14838ms ago, scrape_duration=14ms, samples_scraped=1487, error=
  ```
diff --git a/...ns/network/external_dns/External_DNS_Failing_to_Discover_Services_Workaround.md b/...ns/network/external_dns/External_DNS_Failing_to_Discover_Services_Workaround.md
@@ -30,7 +30,7 @@ Use this procedure to resolve any external DNS routing issues with backend servi
  services sma-kibana [services-gateway] [sma-kibana.cmn.SYSTEM_DOMAIN_NAME] 2d16h
  sysmgmt-health cray-sysmgmt-health-alertmanager [services/services-gateway] [alertmanager.cmn.SYSTEM_DOMAIN_NAME] 2d16h
  sysmgmt-health cray-sysmgmt-health-grafana [services/services-gateway] [grafana.cmn.SYSTEM_DOMAIN_NAME] 2d16h
- sysmgmt-health cray-sysmgmt-health-prometheus [services/services-gateway] [prometheus.cmn.SYSTEM_DOMAIN_NAME] 2d16h
+ sysmgmt-health cray-sysmgmt-health-vm-select  [services/services-gateway] [vmselect.cmn.SYSTEM_DOMAIN_NAME]  2d16h
  ```
 
 1. (`ncn-mw#`) Inspect the `VirtualService` objects to learn the destination service and port.
@@ -47,37 +47,41 @@ Use this procedure to resolve any external DNS routing issues with backend servi
  apiVersion: networking.istio.io/v1beta1
  kind: VirtualService
  metadata:
- creationTimestamp: "2020-07-09T17:49:07Z"
- generation: 1
- labels:
- app: cray-sysmgmt-health-prometheus
- app.kubernetes.io/instance: cray-sysmgmt-health
- app.kubernetes.io/managed-by: Tiller
- app.kubernetes.io/name: cray-sysmgmt-health
- app.kubernetes.io/version: 8.15.4
- helm.sh/chart: cray-sysmgmt-health-0.3.1
- name: cray-sysmgmt-health-prometheus
- namespace: sysmgmt-health
- resourceVersion: "41620"
- selfLink: /apis/networking.istio.io/v1beta1/namespaces/sysmgmt-health/virtualservices/cray-sysmgmt-health-prometheus
- uid: d239dfcc-a827-4a51-9b73-6eccfb937088
- spec:
- gateways:
- - services/services-gateway
- hosts:
- - prometheus.cmn.SYSTEM_DOMAIN_NAME
+ annotations:
+ meta.helm.sh/release-name: cray-sysmgmt-health
+ meta.helm.sh/release-namespace: sysmgmt-health
+ creationTimestamp: "2024-10-15T12:59:14Z"
+ generation: 1
+ labels:
+ app: cray-sysmgmt-health-vm-select
+ app.kubernetes.io/instance: cray-sysmgmt-health
+ app.kubernetes.io/managed-by: Helm
+ app.kubernetes.io/name: cray-sysmgmt-health
+ app.kubernetes.io/version: 0.17.5
+ helm.sh/chart: cray-sysmgmt-health-1.0.17-20241016103148_b40f1aa
+ name: cray-sysmgmt-health-vm-select
+ namespace: sysmgmt-health
+ resourceVersion: "149049132"
+ uid: d166065d-1b3b-4434-b25b-e95cb8940b01
+ spec:
+ gateways:
+ - services/services-gateway
+ - services/customer-admin-gateway
+ hosts:
+ - vmselect.cmn.mug.hpc.amslabs.hpecorp.net
  http:
  - match:
- - authority:
- exact: prometheus.cmn.SYSTEM_DOMAIN_NAME
- route:
- - destination:
- host: cray-sysmgmt-health-kube-p-prometheus
- port:
- number: 9090
+ - authority:
+ exact: vmselect.cmn.mug.hpc.amslabs.hpecorp.net
+ route:
+ - destination:
+ host: vmselect-vms
+ port:
+ number: 8481
+
  ```
 
- From the `VirtualService data`, it is straightforward to see how traffic will be routed. In this example, connections to `prometheus.cmn.SYSTEM_DOMAIN_NAME` will be routed to the
- `cray-sysmgmt-health-prometheus` service in the `sysmgmt-health` namespace on port 9090.
+ From the `VirtualService data`, it is straightforward to see how traffic will be routed. In this example, connections to `vmselect.cmn.SYSTEM_DOMAIN_NAME` will be routed to the
+ `cray-sysmgmt-health-prometheus` service in the `sysmgmt-health` namespace on port 8481.
 
 External DNS will now be connected to the backend service.
diff --git a/...external_dns/Troubleshoot_Systems_Not_Provisioned_with_External_IP_Addresses.md b/...external_dns/Troubleshoot_Systems_Not_Provisioned_with_External_IP_Addresses.md
@@ -33,7 +33,7 @@ The Customer Management Network \(CMN\) is not supported on the system.
  services sma-kibana [services-gateway] [sma-kibana.cmn.SYSTEM_DOMAIN_NAME] 2d16h
  sysmgmt-health cray-sysmgmt-health-alertmanager [services/services-gateway] [alertmanager.cmn.SYSTEM_DOMAIN_NAME] 2d16h
  sysmgmt-health cray-sysmgmt-health-grafana [services/services-gateway] [grafana.cmn.SYSTEM_DOMAIN_NAME] 2d16h
- sysmgmt-health cray-sysmgmt-health-prometheus [services/services-gateway] [prometheus.cmn.SYSTEM_DOMAIN_NAME] 2d16h
+ sysmgmt-health cray-sysmgmt-health-prometheus [services/services-gateway] [vmselect.cmn.SYSTEM_DOMAIN_NAME] 2d16h
  ```
 
 2. Lookup the cluster IP and port for service.
@@ -48,7 +48,7 @@ The Customer Management Network \(CMN\) is not supported on the system.
 
  ```console
  NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
- cray-sysmgmt-health-kube-p-prometheus ClusterIP 10.25.124.159 <none> 9090/TCP 23h
+ cray-sysmgmt-health-grafana ClusterIP 10.25.124.159 <none> 9090/TCP 23h
  ```
 
 3. Setup port forwarding from a laptop or workstation to access the service.
@@ -62,3 +62,35 @@ The Customer Management Network \(CMN\) is not supported on the system.
  ```
 
 4. Visit `http://localhost:9090/` in a laptop or workstation browser.
+
+5 There is no clusterip for vmselect dueto headless service
+ Below are the steps to access headless service
+ a) Lookup the service anme and port for vmselect service
+ The example below is for the `vmselect-vms` service.
+
+ ```bash
+ kubectl -n sysmgmt-health get service vmselect-vms
+ ```
+
+ Example output:
+
+ ```console
+ NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
+ vmselect-vms ClusterIP None <none> 8481/TCP 14d
+ ```
+ use kubectl port-forward to connect to a vmselect server running in a Kubernetes cluster
+ ```bash
+ kubectl port-forward -n sysmgmt-health service/vmselect-vms 8082:8481
+ ```
+
+ Setup port forwarding from a laptop or workstation to access the service.
+
+ Use the cluster IP and port for the service obtained in the previous step. If the port is unprivileged, use the same port number on the local side.
+
+ Replace the cluster IP, port, and system name values in the example below.
+
+ ```bash
+ # ssh -L 9090:10.25.124.159:8082 root@SYSTEM_NCN_DOMAIN_NAME
+ ```
+
+ b Visit `http://localhost:9090/` in a laptop or workstation browser.
diff --git a/operations/system_management_health/Access_System_Management_Health_Services.md b/operations/system_management_health/Access_System_Management_Health_Services.md
@@ -44,34 +44,33 @@ When accessing the URLs listed below, it will be necessary to accept one or more
 logging in. The details of the security warning will indicate that a self-signed certificate/unknown issuer is being used for the site. Support for incorporation of certificates from Trusted Certificate
 Authorities is planned for a future release.
 
-### Prometheus
+### VictoriaMetrics UI
 
-URL: `https://prometheus.cmn.SYSTEM_DOMAIN_NAME/`
+URL: `https://vmselect..cmn.SYSTEM_DOMAIN_NAME/select/0/prometheus/vmui`
 
-Central Prometheus instance scrapes metrics from Kubernetes, Ceph, and the hosts (part of `kube-prometheus-stack` Helm chart).
+VMagent instance scrapes metrics from Kubernetes, Ceph, and the hosts (part of ` victoria-metrics-k8s-stack` Helm chart).
 
-Prometheus generates alerts based on metrics and reports them to the Alertmanager. The 'Alerts' link at the top of the page will show all of the inactive, pending, and firing alerts on the system.
+Victoria metrics generates alerts based on metrics and reports them to the Alertmanager. The 'Alerts' link at the top of the page will show all of the inactive, pending, and firing alerts on the system.
 Clicking on any of the alerts will expand them, enabling users to use the 'Labels' data to discern the details of the alert. The details will also show the state of the alert, how long it has been
 active, and the value for the alert.
 
-For more information regarding the use of the Prometheus interface, see
-[Getting Started/](https://prometheus.io/docs/prometheus/latest/getting_started/) in the Prometheus online documentation.
+For more information regarding the use of the victoria metrics interface, see
+[Getting Started/](https://docs.victoriametrics.com/) in the victoria metrics online documentation.
 
 Some alerts may be falsely triggered. This occurs if they are alerts which will be improved in the future, or if they are alerts impacted by whether all software products have been installed yet.
 See [Troubleshoot Prometheus Alerts](Troubleshoot_Prometheus_Alerts.md).
 
-### Thanos
+### VMalert
+
+URL: `https://vmselect.cmn.SYSTEM_DOMAIN_NAME/select/0/prometheus/vmalert/`
+VMAlert - executes a list of given alerting or recording rules against configured address.
 
-URL: `https://thanos.cmn.SYSTEM_DOMAIN_NAME/`
+The VMAlert CRD declaratively defines a desired VMAlert setup to run in a Kubernetes cluster.
 
-Thanos is a set of components that can be composed into a highly available, multi Prometheus metric system with potentially unlimited storage capacity, if your Object Storage allows for it.
-It leverages the Prometheus 2.0 storage format to cost-efficiently store historical metric data in any object storage while retaining fast query latencies.
-Additionally, it provides a global query view across all Prometheus installations and can merge data from Prometheus HA pairs.
+It has few required config options - datasource and notifier are required, for other config parameters check doc.
 
-For more information regarding the use of the Thanos interface, see
-[Getting Started/](https://thanos.io/tip/thanos/getting-started.md/) in the thanos online documentation.
+For each VMAlert resource, the Operator deploys a properly configured Deployment in the same namespace. The VMAlert Pods are configured to mount a list of Configmaps prefixed with <VMAlert-name>-number containing the configuration for alerting rules.
 
-### Alertmanager
 
 URL: `https://alertmanager.cmn.SYSTEM_DOMAIN_NAME/`
 

diff --git a/...ons/system_management_health/Configure_Prometheus_Alerta_Alert_Notifications.md b/...ons/system_management_health/Configure_Prometheus_Alerta_Alert_Notifications.md
diff --git a/...ions/system_management_health/Configure_Prometheus_Email_Alert_Notifications.md b/...ions/system_management_health/Configure_Prometheus_Email_Alert_Notifications.md
@@ -62,10 +62,10 @@ This procedure can be performed on any master or worker NCN.
 
  ```yaml
  global:
- resolve_timeout: 5m
+ resolve_timeout: 5h
  route:
  group_by:
- - job
+ - group
  group_interval: 5m
  group_wait: 30s
  receiver: "null"
@@ -93,11 +93,13 @@ This procedure can be performed on any master or worker NCN.
  - to: [email protected]
  from: [email protected]
  # Your smtp server address
+ require_tls: false
  smarthost: smtp.gmail.com:587
  auth_username: [email protected]
  auth_identity: [email protected]
  auth_password: xxxxxxxxxxxxxxxx
  ```
+NOTE: set require_tls: false per receiver, if tls needs to be disabled.
 
 1. (`ncn-mw#`) Replace the alert notification configuration based on the files created in the previous steps.