SUSE · hierynomus · Oct 24, 2024 · Oct 24, 2024 · Oct 25, 2024 · Nov 4, 2024
@@ -0,0 +1,16 @@
+apiVersion: fleet.cattle.io/v1alpha1
+kind: ClusterGroup
+metadata:
+  name: build-a-dino
+  annotations:
+    {}
+    #  key: string
+  labels:
+    {}
+    #  key: string
+  namespace: fleet-default
+spec:
+  selector:
+    matchLabels:
+      gpu-enabled: 'true'
+      app: build-a-dino
@@ -0,0 +1,24 @@
+apiVersion: fleet.cattle.io/v1alpha1
+kind: GitRepo
+metadata:
+  name: build-a-dino
+  annotations:
+    {}
+    #  key: string
+  labels:
+    {}
+    #  key: string
+  namespace: fleet-default
+spec:
+  branch: main
+  correctDrift:
+    enabled: true
+#    force: boolean
+#    keepFailHistory: boolean
+  insecureSkipTLSVerify: false
+  paths:
+    - /fleet/build-a-dino
+#    - string
+  repo: https://github.com/wiredquill/prime-rodeo
+  targets:
+    - clusterGroup: build-a-dino
@@ -0,0 +1,85 @@
+nodes:
+- _type: Monitor
+  arguments:
+    comparator: GT
+    failureState: DEVIATING
+    metric:
+      aliasTemplate: CPU Throttling for ${container} of ${pod_name}
+      query: 100 * sum by (cluster_name, namespace, pod_name, container) (container_cpu_throttled_periods{})
+        / sum by (cluster_name, namespace, pod_name, container) (container_cpu_elapsed_periods{})
+      unit: percent
+    threshold: 95.0
+    urnTemplate: urn:kubernetes:/${cluster_name}:${namespace}:pod/${pod_name}
+  description: |-
+    In Kubernetes, CPU throttling refers to the process where limits are applied to the amount of CPU resources a container can use.
+    This typically occurs when a container approaches the maximum CPU resources allocated to it, causing the system to throttle or restrict
+    its CPU usage to prevent a crash.
+
+    While CPU throttling can help maintain system stability by avoiding crashes due to CPU exhaustion, it can also significantly slow down workload 
+    performance. Ideally, CPU throttling should be avoided by ensuring that containers have access to sufficient CPU resources. 
+    This proactive approach helps maintain optimal performance and prevents the slowdown associated with throttling.
+  function: {{ get "urn:stackpack:common:monitor-function:threshold"  }}
+  id: -13
+  identifier: urn:custom:monitor:pod-cpu-throttling-v2
+  intervalSeconds: 60
+  name: CPU Throttling V2
+  remediationHint: |-
+
+    ### Application behaviour
+
+    Check the container [Logs](/#/components/\{{ componentUrnForUrl \}}#logs) for any hints on how the application is behaving under CPU Throttling
+
+    ### Understanding CPU Usage and CPU Throttling 
+
+    On the [pod metrics page](/#/components/\{{ componentUrnForUrl \}}/metrics) you will find the CPU Usage and CPU Throttling charts. 
+
+    #### CPU Trottling
+
+    The percentage of CPU throttling over time. CPU throttling occurs when a container reaches its CPU limit, restricting its CPU usage to 
+    prevent it from exceeding the specified limit. The higher the percentage, the more throttling is occurring, which means the container's 
+    performance is being constrained.
+
+    #### CPU Usage
+
+    This chart shows three key CPU metrics over time:
+
+    1. Request: The amount of CPU the container requests as its minimum requirement. This sets the baseline CPU resources the container is guaranteed to receive.
+    2. Limit: The maximum amount of CPU the container can use. If the container's usage reaches this limit, throttling will occur.
+    3. Current: The actual CPU usage of the container in real-time.
+
+    The `Request` and `Limit` settings in the container can be seen in `Resource` section in [configuration](/#/components/\{{ componentUrnForUrl\}}#configuration)
+
+    #### Correlation 
+
+    The two charts are correlated in the following way:
+
+    - As the `Current` CPU usage approaches the CPU `Limit`, the CPU throttling percentage increases. This is because the container tries to use more CPU than it is allowed, and the system restricts it, causing throttling.
+    - The aim is to keep the `Current` usage below the `Limit` to minimize throttling. If you see frequent high percentages in the CPU throttling chart, it suggests that you may need to adjust the CPU limits or optimize the container's workload to reduce CPU demand.
+
+
+    ### Adjust CPU Requests and Limits
+
+    On the [pod highlights page](/#/components/\{{ componentUrnForUrl \}}/highlights) and checking whether a `Deployment` event happened recently after which the cpu usage behaviour changed.
+
+    You can investigate which change led to the cpu throttling by checking the [Show last change](/#/components/\{{ componentUrnForUrl \}}#lastChange),
+    which will highlight the latest changeset for the deployment. You can then revert the change or fix the cpu request and limit.
+
+
+    Review the pod's resource requests and limits to ensure they are set appropriately.
+    Show component [configuration](/#/components/\{{ componentUrnForUrl \}}#configuration)
+
+    If the CPU usage consistently hits the limit, consider increasing the CPU limit of the pod. <br/>
+    Edit the pod or deployment configuration file to modify the `resources.limits.cpu` and `resources.requests.cpu` as needed.
+    ```
+    resources:
+      requests:
+        cpu: "500m" # Adjust this value based on analysis
+      limits:
+        cpu: "1" # Adjust this value based on analysis
+    ```
+    If CPU throttling persists, consider horizontal pod autoscaling to distribute the workload across more pods, or adjust the cluster's node resources to meet the demands. Continuously monitor and fine-tune resource settings to optimize performance and prevent further throttling issues.
+  status: ENABLED
+  tags:
+  - cpu
+  - performance
+  - pod
@@ -0,0 +1,236 @@
+nodes:
+- _type: Monitor
+  arguments:
+    failureState: CRITICAL
+    loggingLevel: WARN
+  description: |
+    If a pod is within a waiting state and contains a reason of CreateContainerConfigError, CreateContainerError,
+    CrashLoopBackOff, or ImagePullBackOff it will be seen as deviating.
+  function: {{ get "urn:stackpack:kubernetes-v2:shared:monitor-function:pods-in-waiting-state"  }}
+  id: -6
+  identifier: urn:custom:monitor:pods-in-waiting-state-v2
+  intervalSeconds: 30
+  name: Pods in Waiting State V2
+  remediationHint: |-
+    \{{#if reasons\}}
+    \{{#if reasons.CreateContainerConfigError\}}
+    ## CreateContainerConfigError
+
+    In case of CreateContainerConfigError common causes are a secret or ConfigMap that is referenced in [your pod](/#/components/\{{ componentUrnForUrl \}}), but doesn’t exist.
+
+    ### Missing ConfigMap
+
+    If case of a missing ConfigMap you see an error like `Error: configmap "mydb-config" not found` you see the error mention in the message of this monitor.
+
+    To solve this you should reference an existing ConfigMap.
+
+    An example:
+
+    ```markdown
+    # See if the configmap exists
+    kubectl get configmap mydb-config
+
+    # Create the correct configmap, this is just an example
+    kubectl create configmap mydb-config --from-literal=database_name=mydb
+
+    # Delete and recreate the pod using this configmag
+    kubectl delete -f mydb_pod.yaml
+    kubectl create -f mydb_pod.yaml
+
+    # After recreating the pod this pod should be in a running state.
+    # This is visible because the waiting pod monitor will not trigger anymore on this condition.
+    ```
+
+    ### Missing Secret
+
+    If case of a missing Secret you see an error like `Error from server (NotFound): secrets "my-secret" not found`
+    you see the error mention in the message of this monitor.
+
+    To solve this you should reference an existing ConfigMap.
+
+    An example:
+
+    ```markdown
+    # See if the secret exists
+    kubectl get secret mydb-secret
+
+    # Create the correct configmap, this is just an example
+    kubectl create secret mydb-secret --from-literal=password=mysupersecretpassword
+
+    # Delete and recreate the pod using this configmag
+    kubectl delete -f mydb_pod.yaml
+    kubectl create -f mydb_pod.yaml
+
+    # After recreating the pod this pod should be in a running state.
+    # This is visible because the waiting pod monitor will not trigger anymore on this condition.
+    ```
+    \{{/if\}}
+    \{{#if reasons.CreateContainerError\}}
+    ## CreateContainerError
+
+    Common causes for a CreateContainerError are:
+
+    - Command Not Available
+    - Issues Mounting a Volume
+    - Container Runtime Not Cleaning Up Old Containers
+
+    ### Command Not Available
+
+    In case of ‘`Command Not Available`’ you will find this in the reason field at the top of this monitor (full screen).
+    If this is the case, the first thing you need to investigate is to check that you have a valid ENTRYPOINT in the Dockerfile
+    used to build your container image.
+
+    If you don’t have access to the Dockerfile, you can configure your pod object by using
+    a valid command in the command attribute of the object.
+
+    Check if your pod has a command set by inspecting the [Configuration"](/#/components/\{{ componentUrnForUrl \}}#configuration) on the pod, e.g.:
+
+    ```markdown
+    apiVersion: v1
+    kind: Pod
+    metadata:
+      name: nodeapp
+      labels:
+        app: nodeapp
+    spec:
+      containers:
+        - image: myimage/wrong-node-app
+          name: nodeapp
+          ports:
+            - containerPort: 80
+          **command: ["node", "index.js"]**
+    ```
+
+    If the pod does not have a command set, check the container definition to see if an ENTRYPOINT is set, here you see an example without an existing ENTRYPOINT.
+
+    if no exisiting ENTRYPOINT is set and the pod does not have a command the solution is to use a valid command in the pod definition:
+
+    ```markdown
+    FROM ****node:16.3.0-alpine
+    WORKDIR /usr/src/app
+    COPY package*.json ./
+
+    RUN npm install
+    COPY . .
+
+    EXPOSE 8080
+
+    **ENTRYPOINT []**
+    ```
+
+    ### Issues Mounting a Volume
+
+    In the case of a `volume mount problem` the message of this monitor will give you a hint. For example, if you have a message like:
+
+    ```
+    Error: Error response from daemon: create \mnt\data: "\\mnt\\data" includes invalid characters for a local volume name, only "[a-zA-Z0-9][a-zA-Z0-9_.-]" are allowed. If you intended to pass a host directory, use absolute path
+    ```
+
+    In this case you should use a change the path in the PersistentVolume definition to a valid path. e.g. /mnt/data
+
+    ### Container Runtime Not Cleaning Up Old Containers
+
+    In this case you will see a message like:
+
+    ```
+    The container name "/myapp_ed236ae738" is already in use by container "22f4edaec41cb193857aefcead3b86cdb69edfd69b2ab57486dff63102b24d29". You have to remove (or rename) that container to be able to reuse that name.
+    ```
+
+    This is an indication that the [container runtime](https://kubernetes.io/docs/setup/production-environment/container-runtimes/)
+    doesn’t clean up old containers.
+    In this case the node should be removed from the cluster and the node container runtime should be reinstalled
+    (or be recreated). After that the node should be (re)assigned to the cluster.
+
+    \{{/if\}}
+    \{{#if reasons.CrashLoopBackOff\}}
+    ## CrashLoopBackOff
+
+    When a Kubernetes container has errors, it can enter into a state called CrashLoopBackOff, where Kubernetes attempts to restart the container to resolve the issue.
+
+    The container will continue to restart until the problem is resolved.
+
+    Take the following steps to diagnose the problem:
+
+    ### Container Logs
+    Check the container logs for any explicit errors or warnings
+
+    1. Inspect the [Logs](/#/components/\{{ componentUrnForUrl \}}#logs) of all the containers in this pod.
+    2. Scroll through it and validate if there is an excessive amount of errors.
+        1. if a container is crashing due to an out of memory error, the logs may show errors related to memory allocation or exhaustion.
+            - If this is the case check if the memory limits are too low in which case you can make them higher.
+            - If the memory problem is not resolved you might have introduced an memory leak in which case you want to take a look at the last deployment.
+            - If there are no limits you might have a proble with the physical memory on the node running the pod.
+        2. if a container is crashing due to a configuration error, the logs may show errors related to the incorrect configuration.
+
+    ### Understand application
+
+    It is important to understand what the intended behaviour of the application should be.
+    A good place to start is the [configuration](/#/components/\{{ componentUrnForUrl\}}#configuration).
+    Pay attention to environment variables and volume mounts as these are mechanism to configure the application.
+    We can use references to configmaps and secrets to futher explore configuration information.
+
+    ### Pod Events
+    Check the pod events to identify any explicit errors or warnings.
+      1. Go to the [Pod events page](/#/components/\{{ componentUrnForUrl \}}/events).
+      2. Check if there is a large amount of events like `BackOff`, `FailedScheduling` or `FailedAttachVolume`
+      3. If this is the case, see if the event details (click on the event) contains more information about this issue.
+
+    ### Recent Deployment
+    Look at the pod age in the "About" section on the [Pod highlight page](/#/components/\{{ componentUrnForUrl \}}) to identify any recent deployments that might have caused the issue
+
+    1. The "Age" is shown in the "About" section on the left side of the screen
+    2. If the "Age" and the time that the monitor was triggered are in close proximity then take a look at the most recent deployment by clicking on [Show last change](/#/components/\{{ componentUrnForUrl \}}#lastChange).
+    \{{/if\}}
+    \{{#if reasons.ImagePullBackOff\}}
+    ## ImagePullBackOff
+
+    If you see the "ImagePullBackOff" error message while trying to pull a container image from a registry, it means that
+    the Docker engine was unable to pull the requested image for some reason.
+
+    The reason field at the top of this monitor (full screen) might give you more information about the specific issue at hand.
+
+    ## Diagnose
+
+    To diagnose the problem, try the following actions:
+
+    - Go to the [pod events page filtered by failed or unhealthy events](/#/components/\{{ componentUrnForUrl \}}/events?view=eventTypes--Unhealthy,Created,FailedMount,Failed)
+
+      If there are no "Failed" events shown increase the time-range by clicking on the Zoom-out button on next to the telemetry-time-interval on the bottom left of the timeline.
+
+    Click on the left side of the [Pod highlight page](/#/components/\{{ componentUrnForUrl \}}) on "Containers" in the "Related resources"
+    to view the `containers` and the `Image URL`.
+
+    ## Common causes
+
+    ### Rate Limit
+    A docker hub rate limit has been reached.
+
+    Typical resolution is to authenticate using docker hub credentials (it will increase the rate limit from 100 to 200 pulls per 6 hours)
+    or to get a paid account and authenticate with that (bumping the limit to 5000 pulls per day).
+
+    ### Network connectivity issues
+    Check your internet connection or the connection to the registry where the image is hosted.
+
+    ### Authentication problems
+    If the registry requires authentication, make sure that your credentials are correct and that
+    you have the necessary permissions to access the image.
+
+    ### Image availability
+    Verify that the image you are trying to pull exists in the registry and that you have specified the correct image name and tag.
+
+    Here are some steps you can take to resolve the "ImagePullBackOff" error:
+
+    1. Check the registry logs for any error messages that might provide more information about the issue.
+    2. Verify that the image exists in the registry and that you have the correct image name and tag.
+    3. Check your network connectivity to ensure that you can reach the registry.
+    4. Check the authentication credentials to ensure that they are correct and have the necessary permissions.
+
+    If none of these steps work, you may need to consult the Docker documentation or contact support for the registry or Docker
+    itself for further assistance.
+    \{{/if\}}
+    \{{/if\}}
+  status: ENABLED
+  tags:
+  - pods
+  - containers
+timestamp: 2024-10-17T10:15:31.714348Z[Etc/UTC]
@@ -0,0 +1,12 @@
+apiVersion: v2
+name: ai-model
+description: A Helm chart for ai-model Mackroservices
+type: application
+version: 0.1.0
+appVersion: "0.1.0"
+maintainers:
+  - name: hierynomus
+    email: [email protected]
+keywords:
+- challenge
+- observability