Skip to content

Commit

Permalink
Browse files Browse the repository at this point in the history
  • Loading branch information
rambabubolla authored and shreni123 committed Oct 18, 2024
1 parent 4438003 commit 7a630ab
Show file tree
Hide file tree
Showing 10 changed files with 113 additions and 255 deletions.
1 change: 0 additions & 1 deletion operations/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -457,7 +457,6 @@ confident that a lack of issues indicates the system is operating normally.
- [Configure Prometheus Email Alert Notifications](system_management_health/Configure_Prometheus_Email_Alert_Notifications.md)
- [Grafana Dashboards by Component](system_management_health/Grafana_Dashboards_by_Component.md)
- [Troubleshoot Grafana Dashboard](system_management_health/Troubleshoot_Grafana_Dashboard.md)
- [Grafterm](system_management_health/Grafterm.md)
- [Remove Kiali](system_management_health/Remove_Kiali.md)
- [`prometheus-kafka-adapter` errors during installation](system_management_health/Prometheus_Kafka_Error.md)
- [`grok-exporter` errors during installation](system_management_health/Grok-Exporter_Error.md)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -565,26 +565,27 @@ Run the following steps from a master node.
1. Restart Prometheus.

```bash
kubectl rollout restart -n sysmgmt-health statefulSet/prometheus-cray-sysmgmt-health-kube-p-prometheus
kubectl rollout status -n sysmgmt-health statefulSet/prometheus-cray-sysmgmt-health-kube-p-prometheus
kubectl rollout restart deployment -n sysmgmt-health vmagent-vms-0
kubectl rollout status -n sysmgmt-health deployment.apps/vmagent-vms-0
kubectl rollout restart deployment -n sysmgmt-health vmagent-vms-1
kubectl rollout status -n sysmgmt-health deployment.apps/vmagent-vms-1
```

Example output:

```text
Waiting for 1 pods to be ready...
statefulset rolling update complete ...
deployment "vmagent-vms-0" successfully rolled out
```

1. Check for any `tls` errors from the active Prometheus targets. No errors are expected.

```bash
PROM_IP=$(kubectl get services -n sysmgmt-health cray-sysmgmt-health-kube-p-prometheus -o json | jq -r '.spec.clusterIP')
curl -s http://${PROM_IP}:9090/api/v1/targets | jq -r '.data.activeTargets[] | select(."scrapePool" == "sysmgmt-health/cray-sysmgmt-health-kube-p-kube-etcd/0")' | grep lastError | sort -u
PROM_IP=$(kubectl get services -n sysmgmt-health vmagent-vms -o json | jq -r '.spec.clusterIP')
curl -s http://${PROM_IP}:8429/targets | grep kube-etcd | sort -u
```

Example output:

```text
"lastError": "",
state=up, endpoint=https://10.252.1.10:2379/metrics, labels={endpoint="http-metrics",instance="10.252.1.10:2379",job="kube-etcd",namespace="kube-system",service="vms-kube-etcd"}, scrapes_total=28114, scrapes_failed=0, last_scrape=14838ms ago, scrape_duration=14ms, samples_scraped=1487, error=
```
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ Use this procedure to resolve any external DNS routing issues with backend servi
services sma-kibana [services-gateway] [sma-kibana.cmn.SYSTEM_DOMAIN_NAME] 2d16h
sysmgmt-health cray-sysmgmt-health-alertmanager [services/services-gateway] [alertmanager.cmn.SYSTEM_DOMAIN_NAME] 2d16h
sysmgmt-health cray-sysmgmt-health-grafana [services/services-gateway] [grafana.cmn.SYSTEM_DOMAIN_NAME] 2d16h
sysmgmt-health cray-sysmgmt-health-prometheus [services/services-gateway] [prometheus.cmn.SYSTEM_DOMAIN_NAME] 2d16h
sysmgmt-health cray-sysmgmt-health-vm-select [services/services-gateway] [vmselect.cmn.SYSTEM_DOMAIN_NAME] 2d16h
```

1. (`ncn-mw#`) Inspect the `VirtualService` objects to learn the destination service and port.
Expand All @@ -47,37 +47,41 @@ Use this procedure to resolve any external DNS routing issues with backend servi
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
creationTimestamp: "2020-07-09T17:49:07Z"
generation: 1
labels:
app: cray-sysmgmt-health-prometheus
app.kubernetes.io/instance: cray-sysmgmt-health
app.kubernetes.io/managed-by: Tiller
app.kubernetes.io/name: cray-sysmgmt-health
app.kubernetes.io/version: 8.15.4
helm.sh/chart: cray-sysmgmt-health-0.3.1
name: cray-sysmgmt-health-prometheus
namespace: sysmgmt-health
resourceVersion: "41620"
selfLink: /apis/networking.istio.io/v1beta1/namespaces/sysmgmt-health/virtualservices/cray-sysmgmt-health-prometheus
uid: d239dfcc-a827-4a51-9b73-6eccfb937088
spec:
gateways:
- services/services-gateway
hosts:
- prometheus.cmn.SYSTEM_DOMAIN_NAME
annotations:
meta.helm.sh/release-name: cray-sysmgmt-health
meta.helm.sh/release-namespace: sysmgmt-health
creationTimestamp: "2024-10-15T12:59:14Z"
generation: 1
labels:
app: cray-sysmgmt-health-vm-select
app.kubernetes.io/instance: cray-sysmgmt-health
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: cray-sysmgmt-health
app.kubernetes.io/version: 0.17.5
helm.sh/chart: cray-sysmgmt-health-1.0.17-20241016103148_b40f1aa
name: cray-sysmgmt-health-vm-select
namespace: sysmgmt-health
resourceVersion: "149049132"
uid: d166065d-1b3b-4434-b25b-e95cb8940b01
spec:
gateways:
- services/services-gateway
- services/customer-admin-gateway
hosts:
- vmselect.cmn.mug.hpc.amslabs.hpecorp.net
http:
- match:
- authority:
exact: prometheus.cmn.SYSTEM_DOMAIN_NAME
route:
- destination:
host: cray-sysmgmt-health-kube-p-prometheus
port:
number: 9090
- authority:
exact: vmselect.cmn.mug.hpc.amslabs.hpecorp.net
route:
- destination:
host: vmselect-vms
port:
number: 8481
```

From the `VirtualService data`, it is straightforward to see how traffic will be routed. In this example, connections to `prometheus.cmn.SYSTEM_DOMAIN_NAME` will be routed to the
`cray-sysmgmt-health-prometheus` service in the `sysmgmt-health` namespace on port 9090.
From the `VirtualService data`, it is straightforward to see how traffic will be routed. In this example, connections to `vmselect.cmn.SYSTEM_DOMAIN_NAME` will be routed to the
`cray-sysmgmt-health-prometheus` service in the `sysmgmt-health` namespace on port 8481.

External DNS will now be connected to the backend service.
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ The Customer Management Network \(CMN\) is not supported on the system.
services sma-kibana [services-gateway] [sma-kibana.cmn.SYSTEM_DOMAIN_NAME] 2d16h
sysmgmt-health cray-sysmgmt-health-alertmanager [services/services-gateway] [alertmanager.cmn.SYSTEM_DOMAIN_NAME] 2d16h
sysmgmt-health cray-sysmgmt-health-grafana [services/services-gateway] [grafana.cmn.SYSTEM_DOMAIN_NAME] 2d16h
sysmgmt-health cray-sysmgmt-health-prometheus [services/services-gateway] [prometheus.cmn.SYSTEM_DOMAIN_NAME] 2d16h
sysmgmt-health cray-sysmgmt-health-prometheus [services/services-gateway] [vmselect.cmn.SYSTEM_DOMAIN_NAME] 2d16h
```

2. Lookup the cluster IP and port for service.
Expand All @@ -48,7 +48,7 @@ The Customer Management Network \(CMN\) is not supported on the system.

```console
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
cray-sysmgmt-health-kube-p-prometheus ClusterIP 10.25.124.159 <none> 9090/TCP 23h
cray-sysmgmt-health-grafana ClusterIP 10.25.124.159 <none> 9090/TCP 23h
```

3. Setup port forwarding from a laptop or workstation to access the service.
Expand All @@ -62,3 +62,35 @@ The Customer Management Network \(CMN\) is not supported on the system.
```

4. Visit `http://localhost:9090/` in a laptop or workstation browser.

5 There is no clusterip for vmselect dueto headless service
Below are the steps to access headless service
a) Lookup the service anme and port for vmselect service
The example below is for the `vmselect-vms` service.

```bash
kubectl -n sysmgmt-health get service vmselect-vms
```

Example output:

```console
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
vmselect-vms ClusterIP None <none> 8481/TCP 14d
```
use kubectl port-forward to connect to a vmselect server running in a Kubernetes cluster
```bash
kubectl port-forward -n sysmgmt-health service/vmselect-vms 8082:8481
```

Setup port forwarding from a laptop or workstation to access the service.

Use the cluster IP and port for the service obtained in the previous step. If the port is unprivileged, use the same port number on the local side.

Replace the cluster IP, port, and system name values in the example below.

```bash
# ssh -L 9090:10.25.124.159:8082 root@SYSTEM_NCN_DOMAIN_NAME
```

b Visit `http://localhost:9090/` in a laptop or workstation browser.
Original file line number Diff line number Diff line change
Expand Up @@ -44,34 +44,33 @@ When accessing the URLs listed below, it will be necessary to accept one or more
logging in. The details of the security warning will indicate that a self-signed certificate/unknown issuer is being used for the site. Support for incorporation of certificates from Trusted Certificate
Authorities is planned for a future release.

### Prometheus
### VictoriaMetrics UI

URL: `https://prometheus.cmn.SYSTEM_DOMAIN_NAME/`
URL: `https://vmselect..cmn.SYSTEM_DOMAIN_NAME/select/0/prometheus/vmui`

Central Prometheus instance scrapes metrics from Kubernetes, Ceph, and the hosts (part of `kube-prometheus-stack` Helm chart).
VMagent instance scrapes metrics from Kubernetes, Ceph, and the hosts (part of ` victoria-metrics-k8s-stack` Helm chart).

Prometheus generates alerts based on metrics and reports them to the Alertmanager. The 'Alerts' link at the top of the page will show all of the inactive, pending, and firing alerts on the system.
Victoria metrics generates alerts based on metrics and reports them to the Alertmanager. The 'Alerts' link at the top of the page will show all of the inactive, pending, and firing alerts on the system.
Clicking on any of the alerts will expand them, enabling users to use the 'Labels' data to discern the details of the alert. The details will also show the state of the alert, how long it has been
active, and the value for the alert.

For more information regarding the use of the Prometheus interface, see
[Getting Started/](https://prometheus.io/docs/prometheus/latest/getting_started/) in the Prometheus online documentation.
For more information regarding the use of the victoria metrics interface, see
[Getting Started/](https://docs.victoriametrics.com/) in the victoria metrics online documentation.

Some alerts may be falsely triggered. This occurs if they are alerts which will be improved in the future, or if they are alerts impacted by whether all software products have been installed yet.
See [Troubleshoot Prometheus Alerts](Troubleshoot_Prometheus_Alerts.md).

### Thanos
### VMalert

URL: `https://vmselect.cmn.SYSTEM_DOMAIN_NAME/select/0/prometheus/vmalert/`
VMAlert - executes a list of given alerting or recording rules against configured address.

URL: `https://thanos.cmn.SYSTEM_DOMAIN_NAME/`
The VMAlert CRD declaratively defines a desired VMAlert setup to run in a Kubernetes cluster.

Thanos is a set of components that can be composed into a highly available, multi Prometheus metric system with potentially unlimited storage capacity, if your Object Storage allows for it.
It leverages the Prometheus 2.0 storage format to cost-efficiently store historical metric data in any object storage while retaining fast query latencies.
Additionally, it provides a global query view across all Prometheus installations and can merge data from Prometheus HA pairs.
It has few required config options - datasource and notifier are required, for other config parameters check doc.

For more information regarding the use of the Thanos interface, see
[Getting Started/](https://thanos.io/tip/thanos/getting-started.md/) in the thanos online documentation.
For each VMAlert resource, the Operator deploys a properly configured Deployment in the same namespace. The VMAlert Pods are configured to mount a list of Configmaps prefixed with <VMAlert-name>-number containing the configuration for alerting rules.

### Alertmanager

URL: `https://alertmanager.cmn.SYSTEM_DOMAIN_NAME/`

Expand Down

This file was deleted.

Original file line number Diff line number Diff line change
Expand Up @@ -62,10 +62,10 @@ This procedure can be performed on any master or worker NCN.

```yaml
global:
resolve_timeout: 5m
resolve_timeout: 5h
route:
group_by:
- job
- group
group_interval: 5m
group_wait: 30s
receiver: "null"
Expand Down Expand Up @@ -93,11 +93,13 @@ This procedure can be performed on any master or worker NCN.
- to: [email protected]
from: [email protected]
# Your smtp server address
require_tls: false
smarthost: smtp.gmail.com:587
auth_username: [email protected]
auth_identity: [email protected]
auth_password: xxxxxxxxxxxxxxxx
```
NOTE: set require_tls: false per receiver, if tls needs to be disabled.

1. (`ncn-mw#`) Replace the alert notification configuration based on the files created in the previous steps.

Expand Down
Loading

0 comments on commit 7a630ab

Please sign in to comment.