Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add AI model demo Helm chart and Rancher prime installation script #29

Open
wants to merge 13 commits into
base: develop
Choose a base branch
from
16 changes: 16 additions & 0 deletions assets/fleet/clustergroup.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
apiVersion: fleet.cattle.io/v1alpha1
kind: ClusterGroup
metadata:
name: build-a-dino
annotations:
{}
# key: string
labels:
{}
# key: string
namespace: fleet-default
spec:
selector:
matchLabels:
gpu-enabled: 'true'
app: build-a-dino
24 changes: 24 additions & 0 deletions assets/fleet/gitrepo.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
apiVersion: fleet.cattle.io/v1alpha1
kind: GitRepo
metadata:
name: build-a-dino
annotations:
{}
# key: string
labels:
{}
# key: string
namespace: fleet-default
spec:
branch: main
correctDrift:
enabled: true
# force: boolean
# keepFailHistory: boolean
insecureSkipTLSVerify: false
paths:
- /fleet/build-a-dino
# - string
repo: https://github.com/wiredquill/prime-rodeo
targets:
- clusterGroup: build-a-dino
85 changes: 85 additions & 0 deletions assets/monitors/cpu-throttling.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
nodes:
- _type: Monitor
arguments:
comparator: GT
failureState: DEVIATING
metric:
aliasTemplate: CPU Throttling for ${container} of ${pod_name}
query: 100 * sum by (cluster_name, namespace, pod_name, container) (container_cpu_throttled_periods{})
/ sum by (cluster_name, namespace, pod_name, container) (container_cpu_elapsed_periods{})
unit: percent
threshold: 95.0
urnTemplate: urn:kubernetes:/${cluster_name}:${namespace}:pod/${pod_name}
description: |-
In Kubernetes, CPU throttling refers to the process where limits are applied to the amount of CPU resources a container can use.
This typically occurs when a container approaches the maximum CPU resources allocated to it, causing the system to throttle or restrict
its CPU usage to prevent a crash.

While CPU throttling can help maintain system stability by avoiding crashes due to CPU exhaustion, it can also significantly slow down workload
performance. Ideally, CPU throttling should be avoided by ensuring that containers have access to sufficient CPU resources.
This proactive approach helps maintain optimal performance and prevents the slowdown associated with throttling.
function: {{ get "urn:stackpack:common:monitor-function:threshold" }}
id: -13
identifier: urn:custom:monitor:pod-cpu-throttling-v2
intervalSeconds: 60
name: CPU Throttling V2
remediationHint: |-

### Application behaviour

Check the container [Logs](/#/components/\{{ componentUrnForUrl \}}#logs) for any hints on how the application is behaving under CPU Throttling

### Understanding CPU Usage and CPU Throttling

On the [pod metrics page](/#/components/\{{ componentUrnForUrl \}}/metrics) you will find the CPU Usage and CPU Throttling charts.

#### CPU Trottling

The percentage of CPU throttling over time. CPU throttling occurs when a container reaches its CPU limit, restricting its CPU usage to
prevent it from exceeding the specified limit. The higher the percentage, the more throttling is occurring, which means the container's
performance is being constrained.

#### CPU Usage

This chart shows three key CPU metrics over time:

1. Request: The amount of CPU the container requests as its minimum requirement. This sets the baseline CPU resources the container is guaranteed to receive.
2. Limit: The maximum amount of CPU the container can use. If the container's usage reaches this limit, throttling will occur.
3. Current: The actual CPU usage of the container in real-time.

The `Request` and `Limit` settings in the container can be seen in `Resource` section in [configuration](/#/components/\{{ componentUrnForUrl\}}#configuration)

#### Correlation

The two charts are correlated in the following way:

- As the `Current` CPU usage approaches the CPU `Limit`, the CPU throttling percentage increases. This is because the container tries to use more CPU than it is allowed, and the system restricts it, causing throttling.
- The aim is to keep the `Current` usage below the `Limit` to minimize throttling. If you see frequent high percentages in the CPU throttling chart, it suggests that you may need to adjust the CPU limits or optimize the container's workload to reduce CPU demand.


### Adjust CPU Requests and Limits

On the [pod highlights page](/#/components/\{{ componentUrnForUrl \}}/highlights) and checking whether a `Deployment` event happened recently after which the cpu usage behaviour changed.

You can investigate which change led to the cpu throttling by checking the [Show last change](/#/components/\{{ componentUrnForUrl \}}#lastChange),
which will highlight the latest changeset for the deployment. You can then revert the change or fix the cpu request and limit.


Review the pod's resource requests and limits to ensure they are set appropriately.
Show component [configuration](/#/components/\{{ componentUrnForUrl \}}#configuration)

If the CPU usage consistently hits the limit, consider increasing the CPU limit of the pod. <br/>
Edit the pod or deployment configuration file to modify the `resources.limits.cpu` and `resources.requests.cpu` as needed.
```
resources:
requests:
cpu: "500m" # Adjust this value based on analysis
limits:
cpu: "1" # Adjust this value based on analysis
```
If CPU throttling persists, consider horizontal pod autoscaling to distribute the workload across more pods, or adjust the cluster's node resources to meet the demands. Continuously monitor and fine-tune resource settings to optimize performance and prevent further throttling issues.
status: ENABLED
tags:
- cpu
- performance
- pod
236 changes: 236 additions & 0 deletions assets/monitors/pods-in-waiting-state.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,236 @@
nodes:
- _type: Monitor
arguments:
failureState: CRITICAL
loggingLevel: WARN
description: |
If a pod is within a waiting state and contains a reason of CreateContainerConfigError, CreateContainerError,
CrashLoopBackOff, or ImagePullBackOff it will be seen as deviating.
function: {{ get "urn:stackpack:kubernetes-v2:shared:monitor-function:pods-in-waiting-state" }}
id: -6
identifier: urn:custom:monitor:pods-in-waiting-state-v2
intervalSeconds: 30
name: Pods in Waiting State V2
remediationHint: |-
\{{#if reasons\}}
\{{#if reasons.CreateContainerConfigError\}}
## CreateContainerConfigError

In case of CreateContainerConfigError common causes are a secret or ConfigMap that is referenced in [your pod](/#/components/\{{ componentUrnForUrl \}}), but doesn’t exist.

### Missing ConfigMap

If case of a missing ConfigMap you see an error like `Error: configmap "mydb-config" not found` you see the error mention in the message of this monitor.

To solve this you should reference an existing ConfigMap.

An example:

```markdown
# See if the configmap exists
kubectl get configmap mydb-config

# Create the correct configmap, this is just an example
kubectl create configmap mydb-config --from-literal=database_name=mydb

# Delete and recreate the pod using this configmag
kubectl delete -f mydb_pod.yaml
kubectl create -f mydb_pod.yaml

# After recreating the pod this pod should be in a running state.
# This is visible because the waiting pod monitor will not trigger anymore on this condition.
```

### Missing Secret

If case of a missing Secret you see an error like `Error from server (NotFound): secrets "my-secret" not found`
you see the error mention in the message of this monitor.

To solve this you should reference an existing ConfigMap.

An example:

```markdown
# See if the secret exists
kubectl get secret mydb-secret

# Create the correct configmap, this is just an example
kubectl create secret mydb-secret --from-literal=password=mysupersecretpassword

# Delete and recreate the pod using this configmag
kubectl delete -f mydb_pod.yaml
kubectl create -f mydb_pod.yaml

# After recreating the pod this pod should be in a running state.
# This is visible because the waiting pod monitor will not trigger anymore on this condition.
```
\{{/if\}}
\{{#if reasons.CreateContainerError\}}
## CreateContainerError

Common causes for a CreateContainerError are:

- Command Not Available
- Issues Mounting a Volume
- Container Runtime Not Cleaning Up Old Containers

### Command Not Available

In case of ‘`Command Not Available`’ you will find this in the reason field at the top of this monitor (full screen).
If this is the case, the first thing you need to investigate is to check that you have a valid ENTRYPOINT in the Dockerfile
used to build your container image.

If you don’t have access to the Dockerfile, you can configure your pod object by using
a valid command in the command attribute of the object.

Check if your pod has a command set by inspecting the [Configuration"](/#/components/\{{ componentUrnForUrl \}}#configuration) on the pod, e.g.:

```markdown
apiVersion: v1
kind: Pod
metadata:
name: nodeapp
labels:
app: nodeapp
spec:
containers:
- image: myimage/wrong-node-app
name: nodeapp
ports:
- containerPort: 80
**command: ["node", "index.js"]**
```

If the pod does not have a command set, check the container definition to see if an ENTRYPOINT is set, here you see an example without an existing ENTRYPOINT.

if no exisiting ENTRYPOINT is set and the pod does not have a command the solution is to use a valid command in the pod definition:

```markdown
FROM ****node:16.3.0-alpine
WORKDIR /usr/src/app
COPY package*.json ./

RUN npm install
COPY . .

EXPOSE 8080

**ENTRYPOINT []**
```

### Issues Mounting a Volume

In the case of a `volume mount problem` the message of this monitor will give you a hint. For example, if you have a message like:

```
Error: Error response from daemon: create \mnt\data: "\\mnt\\data" includes invalid characters for a local volume name, only "[a-zA-Z0-9][a-zA-Z0-9_.-]" are allowed. If you intended to pass a host directory, use absolute path
```

In this case you should use a change the path in the PersistentVolume definition to a valid path. e.g. /mnt/data

### Container Runtime Not Cleaning Up Old Containers

In this case you will see a message like:

```
The container name "/myapp_ed236ae738" is already in use by container "22f4edaec41cb193857aefcead3b86cdb69edfd69b2ab57486dff63102b24d29". You have to remove (or rename) that container to be able to reuse that name.
```

This is an indication that the [container runtime](https://kubernetes.io/docs/setup/production-environment/container-runtimes/)
doesn’t clean up old containers.
In this case the node should be removed from the cluster and the node container runtime should be reinstalled
(or be recreated). After that the node should be (re)assigned to the cluster.

\{{/if\}}
\{{#if reasons.CrashLoopBackOff\}}
## CrashLoopBackOff

When a Kubernetes container has errors, it can enter into a state called CrashLoopBackOff, where Kubernetes attempts to restart the container to resolve the issue.

The container will continue to restart until the problem is resolved.

Take the following steps to diagnose the problem:

### Container Logs
Check the container logs for any explicit errors or warnings

1. Inspect the [Logs](/#/components/\{{ componentUrnForUrl \}}#logs) of all the containers in this pod.
2. Scroll through it and validate if there is an excessive amount of errors.
1. if a container is crashing due to an out of memory error, the logs may show errors related to memory allocation or exhaustion.
- If this is the case check if the memory limits are too low in which case you can make them higher.
- If the memory problem is not resolved you might have introduced an memory leak in which case you want to take a look at the last deployment.
- If there are no limits you might have a proble with the physical memory on the node running the pod.
2. if a container is crashing due to a configuration error, the logs may show errors related to the incorrect configuration.

### Understand application

It is important to understand what the intended behaviour of the application should be.
A good place to start is the [configuration](/#/components/\{{ componentUrnForUrl\}}#configuration).
Pay attention to environment variables and volume mounts as these are mechanism to configure the application.
We can use references to configmaps and secrets to futher explore configuration information.

### Pod Events
Check the pod events to identify any explicit errors or warnings.
1. Go to the [Pod events page](/#/components/\{{ componentUrnForUrl \}}/events).
2. Check if there is a large amount of events like `BackOff`, `FailedScheduling` or `FailedAttachVolume`
3. If this is the case, see if the event details (click on the event) contains more information about this issue.

### Recent Deployment
Look at the pod age in the "About" section on the [Pod highlight page](/#/components/\{{ componentUrnForUrl \}}) to identify any recent deployments that might have caused the issue

1. The "Age" is shown in the "About" section on the left side of the screen
2. If the "Age" and the time that the monitor was triggered are in close proximity then take a look at the most recent deployment by clicking on [Show last change](/#/components/\{{ componentUrnForUrl \}}#lastChange).
\{{/if\}}
\{{#if reasons.ImagePullBackOff\}}
## ImagePullBackOff

If you see the "ImagePullBackOff" error message while trying to pull a container image from a registry, it means that
the Docker engine was unable to pull the requested image for some reason.

The reason field at the top of this monitor (full screen) might give you more information about the specific issue at hand.

## Diagnose

To diagnose the problem, try the following actions:

- Go to the [pod events page filtered by failed or unhealthy events](/#/components/\{{ componentUrnForUrl \}}/events?view=eventTypes--Unhealthy,Created,FailedMount,Failed)

If there are no "Failed" events shown increase the time-range by clicking on the Zoom-out button on next to the telemetry-time-interval on the bottom left of the timeline.

Click on the left side of the [Pod highlight page](/#/components/\{{ componentUrnForUrl \}}) on "Containers" in the "Related resources"
to view the `containers` and the `Image URL`.

## Common causes

### Rate Limit
A docker hub rate limit has been reached.

Typical resolution is to authenticate using docker hub credentials (it will increase the rate limit from 100 to 200 pulls per 6 hours)
or to get a paid account and authenticate with that (bumping the limit to 5000 pulls per day).

### Network connectivity issues
Check your internet connection or the connection to the registry where the image is hosted.

### Authentication problems
If the registry requires authentication, make sure that your credentials are correct and that
you have the necessary permissions to access the image.

### Image availability
Verify that the image you are trying to pull exists in the registry and that you have specified the correct image name and tag.

Here are some steps you can take to resolve the "ImagePullBackOff" error:

1. Check the registry logs for any error messages that might provide more information about the issue.
2. Verify that the image exists in the registry and that you have the correct image name and tag.
3. Check your network connectivity to ensure that you can reach the registry.
4. Check the authentication credentials to ensure that they are correct and have the necessary permissions.

If none of these steps work, you may need to consult the Docker documentation or contact support for the registry or Docker
itself for further assistance.
\{{/if\}}
\{{/if\}}
status: ENABLED
tags:
- pods
- containers
timestamp: 2024-10-17T10:15:31.714348Z[Etc/UTC]
12 changes: 12 additions & 0 deletions charts/ai-model/Chart.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
apiVersion: v2
name: ai-model
description: A Helm chart for ai-model Mackroservices
type: application
version: 0.1.0
appVersion: "0.1.0"
maintainers:
- name: hierynomus
email: [email protected]
keywords:
- challenge
- observability
Loading
Loading