Skip to content

Commit

Permalink
Browse files Browse the repository at this point in the history
  • Loading branch information
rambabubolla authored and shreni123 committed Oct 18, 2024
1 parent 3c6473c commit 4438003
Show file tree
Hide file tree
Showing 18 changed files with 191 additions and 286 deletions.
10 changes: 6 additions & 4 deletions .spelling
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,12 @@ DaemonSet
DaemonSets
StatefulSet
StatefulSets

VictoriaMetrics
Vmagent
Vmalert
Vminsert
Vmselect
Vmstorage

1.0.x
1.2.x
Expand Down Expand Up @@ -1071,9 +1076,6 @@ Init
# To allow System Admin Toolkit
Admin

- operations/system_management_health/Configure_Prometheus_Alerta_Alert_Notifications.md
Alerta

- operations/system_management_health/thanos.md
gRPC
spec
Expand Down
68 changes: 34 additions & 34 deletions operations/CSM_product_management/Post_Install_Customizations.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,13 +57,12 @@ Check to see if there are any recent out of memory events.

1. Search for the "Kubernetes / Compute Resources / Pod" dashboard to view the memory utilization graphs over time for any pod that has been `OOMKilled`.

## Prometheus `CPUThrottlingHigh` alerts
## vmstorage `CPUThrottlingHigh` alerts

Check Prometheus for recent `CPUThrottlingHigh` alerts.

1. Log in to Prometheus at the following URL: `https://prometheus.cmn.SYSTEM_DOMAIN_NAME/`
1. Log in to vmalert at the following URL: `https://vmselect..cmn.SYSTEM_DOMAIN_NAME/select/0/prometheus/vmalert/api/v1/alerts`

1. Select the **Alert** tab.

1. Scroll down to the alert for `CPUThrottlingHigh`.

Expand All @@ -86,7 +85,7 @@ Use Grafana to investigate and analyze CPU throttling and memory usage.
```yaml
datasource: default
namespace: sysmgmt-health
pod: prometheus-cray-sysmgmt-health-kube-p-prometheus-0
pod: vmstorage-vms-0
```

### CPU throttling
Expand Down Expand Up @@ -122,12 +121,12 @@ Use Grafana to investigate and analyze CPU throttling and memory usage.
## Common customization scenarios

* [Prerequisites](#prerequisites)
* [Prometheus pod is `OOMKilled` or CPU throttled](#prometheus-pod-is-oomkilled-or-cpu-throttled)
* [vmstorage pods are `OOMKilled` or CPU throttled](#prometheus-pod-is-oomkilled-or-cpu-throttled)
* [Postgres pods are `OOMKilled` or CPU throttled](#postgres-pods-are-oomkilled-or-cpu-throttled)
* [Scale `cray-bss` service](#scale-cray-bss-service)
* [Scale `cray-dns-unbound` service](#scale-cray-dns-unbound-service)
* [Postgres PVC resize](#postgres-pvc-resize)
* [Prometheus PVC resize](#prometheus-pvc-resize)
* [vmstorage PVC resize](#prometheus-pvc-resize)
* [`cray-hms-hmcollector` pods are `OOMKilled`](#cray-hms-hmcollector-pods-are-oomkilled)
* [`cray-cfs-api` pods are `OOMKilled`](#cray-cfs-api-pods-are-oomkilled)
* [References](#references)
Expand All @@ -139,7 +138,7 @@ procedure for a specific chart. In these cases, the section on this page provide
information necessary in order to carry out that procedure. It is recommended to keep both pages open
in different browser windows for easy reference.

### Prometheus pod is `OOMKilled` or CPU throttled
### Vmstorage pod is `OOMKilled` or CPU throttled

Update resources associated with Prometheus in the `sysmgmt-health` namespace.
This example is based on what was needed for a system with 4000 compute nodes.
Expand All @@ -153,41 +152,41 @@ Follow the [Redeploying a Chart](Redeploying_a_Chart.md) procedure **with the fo

**Only follow these steps as part of the previously linked chart redeploy procedure.**

1. Edit the customizations by adding or updating `spec.kubernetes.services.cray-sysmgmt-health.kube-prometheus-stack.prometheus.prometheusSpec.resources`.
1. Edit the customizations by adding or updating `spec.kubernetes.services.cray-sysmgmt-health.victoria-metrics-k8s-stack.vmcluster.spec.vmstorage.resources`.

* If the number of NCNs is less than 20, then:

```bash
yq write -i customizations.yaml 'spec.kubernetes.services.cray-sysmgmt-health.kube-prometheus-stack.prometheus.prometheusSpec.resources.requests.cpu' --style=double '2'
yq write -i customizations.yaml 'spec.kubernetes.services.cray-sysmgmt-health.kube-prometheus-stack.prometheus.prometheusSpec.resources.requests.memory' '15Gi'
yq write -i customizations.yaml 'spec.kubernetes.services.cray-sysmgmt-health.kube-prometheus-stack.prometheus.prometheusSpec.resources.limits.cpu' --style=double '6'
yq write -i customizations.yaml 'spec.kubernetes.services.cray-sysmgmt-health.kube-prometheus-stack.prometheus.prometheusSpec.resources.limits.memory' '30Gi'
yq write -i customizations.yaml 'spec.kubernetes.services.cray-sysmgmt-health.victoria-metrics-k8s-stack.vmcluster.spec.vmstorage.resources.requests.cpu' --style=double '4'
yq write -i customizations.yaml 'spec.kubernetes.services.cray-sysmgmt-health.victoria-metrics-k8s-stack.vmcluster.spec.vmstorage.resources.requests.memory' '8Gi'
yq write -i customizations.yaml 'spec.kubernetes.services.cray-sysmgmt-health.victoria-metrics-k8s-stack.vmcluster.spec.vmstorage.resources.limits.cpu' --style=double '8'
yq write -i customizations.yaml 'spec.kubernetes.services.cray-sysmgmt-health.victoria-metrics-k8s-stack.vmcluster.spec.vmstorage.resources.limits.memory' '16Gi'
```

* If the number of NCNs is 20 or more, then:

```bash
yq write -i customizations.yaml 'spec.kubernetes.services.cray-sysmgmt-health.kube-prometheus-stack.prometheus.prometheusSpec.resources.requests.cpu' --style=double '6'
yq write -i customizations.yaml 'spec.kubernetes.services.cray-sysmgmt-health.kube-prometheus-stack.prometheus.prometheusSpec.resources.requests.memory' '50Gi'
yq write -i customizations.yaml 'spec.kubernetes.services.cray-sysmgmt-health.kube-prometheus-stack.prometheus.prometheusSpec.resources.limits.cpu' --style=double '12'
yq write -i customizations.yaml 'spec.kubernetes.services.cray-sysmgmt-health.kube-prometheus-stack.prometheus.prometheusSpec.resources.limits.memory' '60Gi'
yq write -i customizations.yaml 'spec.kubernetes.services.cray-sysmgmt-health.victoria-metrics-k8s-stack.vmcluster.spec.vmstorage.resources.requests.cpu' --style=double '6'
yq write -i customizations.yaml 'spec.kubernetes.services.cray-sysmgmt-health.victoria-metrics-k8s-stack.vmcluster.spec.vmstorage.resources.requests.memory' '16Gi'
yq write -i customizations.yaml 'spec.kubernetes.services.cray-sysmgmt-health.victoria-metrics-k8s-stack.vmcluster.spec.vmstorage.resources.limits.cpu' --style=double '12'
yq write -i customizations.yaml 'spec.kubernetes.services.cray-sysmgmt-health.victoria-metrics-k8s-stack.vmcluster.spec.vmstorage.resources.limits.memory' '32Gi'
```

1. Check that the customization file has been updated.

```bash
yq read customizations.yaml 'spec.kubernetes.services.cray-sysmgmt-health.kube-prometheus-stack.prometheus.prometheusSpec.resources'
yq read customizations.yaml 'spec.kubernetes.services.cray-sysmgmt-health.victoria-metrics-k8s-stack.vmcluster.spec.vmstorage.resources'
```

Example output:

```yaml
requests:
cpu: "3"
memory: 15Gi
cpu: "4"
memory: 8Gi
limits:
cpu: "6"
memory: 30Gi
cpu: "8"
memory: 16Gi
```

* (`ncn-mw#`) When reaching the step to validate the redeployed chart, perform the following steps:
Expand All @@ -196,23 +195,24 @@ Follow the [Redeploying a Chart](Redeploying_a_Chart.md) procedure **with the fo

1. Verify that the pod restarts and that the desired resources have been applied.

Watch the `prometheus-cray-sysmgmt-health-kube-p-prometheus-0` pod restart.
Watch the `vmstorage-vms` pods restart.

```bash
watch "kubectl get pods -n sysmgmt-health -l prometheus=cray-sysmgmt-health-kube-p-prometheus"
watch "kubectl get pods -n sysmgmt-health -l app.kubernetes.io/name=vmstorage"
```

It may take about 10 minutes for the `prometheus-cray-sysmgmt-health-kube-p-prometheus-0` pod to terminate.
It may take about 10 minutes for the `vmstorage-vms-*` pods to terminate.
It can be forced deleted if it remains in the terminating state:

```bash
kubectl delete pod prometheus-cray-sysmgmt-health-kube-p-prometheus-0 --force --grace-period=0 -n sysmgmt-health
kubectl delete pod vmstorage-vms-0 --force --grace-period=0 -n sysmgmt-health
kubectl delete pod vmstorage-vms-1 --force --grace-period=0 -n sysmgmt-health
```

1. Verify that the resource changes are in place.

```bash
kubectl get pod prometheus-cray-sysmgmt-health-kube-p-prometheus-0 -n sysmgmt-health -o json | jq -r '.spec.containers[] | select(.name == "prometheus").resources'
kubectl get pod vmstorage-vms-0 -n sysmgmt-health -o json | jq -r '.spec.containers[] | select(.name == "vmstorage").resources'
```

* **Make sure to perform the entire linked procedure, including the step to save the updated customizations.**
Expand Down Expand Up @@ -525,9 +525,9 @@ Using the values from the above table, follow the [Redeploying a Chart](Redeploy

* **Make sure to perform the entire linked procedure, including the step to save the updated customizations.**

### Prometheus PVC resize
### Vmselect PVC resize

Increase the PVC volume size associated with `prometheus-cray-sysmgmt-health-kube-p-prometheus` cluster in the `sysmgmt-health` namespace.
Increase the PVC volume size associated with `vmstorage` cluster in the `sysmgmt-health` namespace.
This example is based on what was needed for a system with more than 20 non compute nodes (NCNs). The PVC size can only ever be increased.

Follow the [Redeploying a Chart](Redeploying_a_Chart.md) procedure **with the following specifications**:
Expand All @@ -538,16 +538,16 @@ Follow the [Redeploying a Chart](Redeploying_a_Chart.md) procedure **with the fo

**Only follow these steps as part of the previously linked chart redeploy procedure.**

1. Edit the customizations by adding or updating `spec.kubernetes.services.cray-sysmgmt-health.kube-prometheus-stack.prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage`.
1. Edit the customizations by adding or updating `spec.kubernetes.services.cray-sysmgmt-health.victoria-metrics-k8s-stack.vmcluster.spec.vmstorage.storage.volumeClaimTemplate.spec.resources.requests.storage`.

```bash
yq write -i customizations.yaml 'spec.kubernetes.services.cray-sysmgmt-health.kube-prometheus-stack.prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage' '300Gi'
yq write -i customizations.yaml 'spec.kubernetes.services.cray-sysmgmt-health.victoria-metrics-k8s-stack.vmcluster.spec.vmstorage.storage.volumeClaimTemplate.spec.resources.requests.storage' '300Gi'
```

1. Check that the customization file has been updated.

```bash
yq read customizations.yaml 'spec.kubernetes.services.cray-sysmgmt-health.kube-prometheus-stack.prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage'
yq read customizations.yaml 'spec.kubernetes.services.cray-sysmgmt-health.victoria-metrics-k8s-stack.vmcluster.spec.vmstorage.storage.volumeClaimTemplate.spec.resources.requests.storage'
```

Example output:
Expand All @@ -563,14 +563,14 @@ Follow the [Redeploying a Chart](Redeploying_a_Chart.md) procedure **with the fo
Verify that the increased volume size has been applied.

```bash
watch "kubectl get pvc -n sysmgmt-health prometheus-cray-sysmgmt-health-kube-p-prometheus-db-prometheus-cray-sysmgmt-health-kube-p-prometheus-0"
watch "kubectl get pvc -n sysmgmt-health vmstorage-db-vmstorage-vms-0"
```

Example output:

```text
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
prometheus-cray-sysmgmt-health-kube-p-prometheus-db-prometheus-cray-sysmgmt-health-kube-p-prometheus-0 Bound pvc-bcb8f4f1-fb84-4b48-95c7-63508ef18962 200Gi RWO k8s-block-replicated 3d2h
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
vmstorage-db-vmstorage-vms-0 Bound pvc-092805e3-ac92-438e-b77b-a0639096f5f0 200Gi RWO k8s-block-replicated 3d2h
```

At this point the Prometheus cluster is healthy, but additional steps are required to complete the resize of the Prometheus PVCs.
Expand Down
1 change: 0 additions & 1 deletion operations/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -454,7 +454,6 @@ confident that a lack of issues indicates the system is operating normally.
- [System Management Health](system_management_health/System_Management_Health.md)
- [System Management Health Checks and Alerts](system_management_health/System_Management_Health_Checks_and_Alerts.md)
- [Access System Management Health Services](system_management_health/Access_System_Management_Health_Services.md)
- [Configure Prometheus Alerta Alert Notifications](system_management_health/Configure_Prometheus_Alerta_Alert_Notifications.md)
- [Configure Prometheus Email Alert Notifications](system_management_health/Configure_Prometheus_Email_Alert_Notifications.md)
- [Grafana Dashboards by Component](system_management_health/Grafana_Dashboards_by_Component.md)
- [Troubleshoot Grafana Dashboard](system_management_health/Troubleshoot_Grafana_Dashboard.md)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -80,17 +80,17 @@ Identify pods that are hitting resource limits in order to increase the resource
Example output:

```text
default 54m Warning OOMKilling node/ncn-w003 Memory cgroup out of memory: Kill process 1223856 (prometheus) score 1966 or sacrifice child
default 44m Warning OOMKilling node/ncn-w003 Memory cgroup out of memory: Kill process 1372634 (prometheus) score 1966 or sacrifice child
default 54m Warning OOMKilling node/ncn-w003 Memory cgroup out of memory: Kill process 1223856 (vmstorage) score 1966 or sacrifice child
default 44m Warning OOMKilling node/ncn-w003 Memory cgroup out of memory: Kill process 1372634 (vmstorage) score 1966 or sacrifice child
```

1. (`ncn-mw#`) Determine which pod was killed using the output of the previous command.

Search the pods in Kubernetes for the string returned in the previous step to find the exact pod name.
Based on the previous example command output, `prometheus` is used in this example:
Based on the previous example command output, `vmstorage` is used in this example:

```bash
kubectl get pod -A | grep prometheus
kubectl get pod -A | grep vmstorage
```

1. Increase the resource limits for the pods identified in this procedure.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ The Customer Management Network \(CMN\) provides access from outside the custome

- Administrator clients outside of the system:
- Log in to NCNs.
- Access administrative web UIs within the system \(e.g. Prometheus, Grafana, and more\).
- Access administrative web UIs within the system \(e.g. Vmselect, Grafana, and more\).
- Access the administrative REST APIs.
- Access a DNS server within the system for resolution of names for the webUI and REST API services.
- Run administrative Cray CLI commands from outside the system.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ See [External DNS](../external_dns/External_DNS.md) for more information.
| OAuth2 Proxy Ingress - CMN |   | customer-management | Yes | 443 |   |
| OAuth2 Proxy Ingress - CAN |   | customer-access | Yes | 443 |   |
| OAuth2 Proxy Ingress - CHN |   | customer-high-speed | Yes | 443 |   |
| System Management Health Prometheus | `prometheus` | |   | No | Uses the IP address of OAuth2 Proxy Ingress (CMN) |
| System Management Health Vmselect | `vmselect` | |   | No | Uses the IP address of OAuth2 Proxy Ingress (CMN) |
| System Management Health Alert Manager | `alertmanager` | |   | No | Uses the IP address of OAuth2 Proxy Ingress (CMN) |
| System Management Health Grafana | `grafana` | |   | No | Uses the IP address of OAuth2 Proxy Ingress (CMN) |
| Istio Kiali | `kiali-istio` | |   | No | Uses the IP address of OAuth2 Proxy Ingress (CMN) |
Expand Down
4 changes: 2 additions & 2 deletions operations/network/dns/PowerDNS_Configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,13 +31,13 @@ cray-dns-powerdns-nmn-udp LoadBalancer 10.17.203.241 10.92.100.85 53:318

A system administrator would typically setup the subdomain `system.dev.cray.com` in their site DNS and create a record which points to the IP address `10.101.8.113`, for example `ins1.system.dev.cray.com`.

The administrator would then delegate queries to `system.dev.cray.com` to `ins1.system.dev.cray.com` making it authoritative for that subdomain allowing CSM to respond to queries for services like `prometheus.system.dev.cray.com`
The administrator would then delegate queries to `system.dev.cray.com` to `ins1.system.dev.cray.com` making it authoritative for that subdomain allowing CSM to respond to queries for services like `grafana.system.dev.cray.com`

The specifics of how to configure to configuring DNS forwarding is dependent on the DNS server in use, please consult the documentation provided by the DNS server vendor for more information.

## Authoritative Zone Transfer

In addition to responding to external DNS queries, PowerDNS can support replication of domain information to secondary servers via AXFR (Authoritative Zone Transfer) queries.
In addition to responding to external DNS queries, PowerDNS can support replication of domain information to secondary servers via AXFR (Authoritative Zone Transfer) queries..

### Configuration parameters

Expand Down
2 changes: 1 addition & 1 deletion operations/network/dns/PowerDNS_migration.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@ The following table of examples assumes that the system was configured with a `s
| auth.shasta.dev.cray.com | auth.cmn.shasta.dev.cray.com | | No |
| nexus.shasta.dev.cray.com | nexus.cmn.shasta.dev.cray.com | | No |
| grafana.shasta.dev.cray.com | grafana.cmn.shasta.dev.cray.com | | No |
| prometheus.shasta.dev.cray.com | prometheus.cmn.shasta.dev.cray.com | | No |
| vmselect.shasta.dev.cray.com | vmselect.cmn.shasta.dev.cray.com | | No |
| alertmanager.shasta.dev.cray.com | alertmanager.cmn.shasta.dev.cray.com | | No |
| vcs.shasta.dev.cray.com | vcs.cmn.shasta.dev.cray.com | | No |
| kiali-istio.shasta.dev.cray.com | kiali-istio.cmn.shasta.dev.cray.com | | No |
Expand Down
Loading

0 comments on commit 4438003

Please sign in to comment.