truecharts · alfi0812 · Dec 16, 2024 · Dec 16, 2024
@@ -0,0 +1,253 @@
+---
+title: Adding a Nvidia-GPU to your Cluster
+---
+
+:::caution[Charts]
+
+Adding a GPU to your Cluster isn't covered by the Support Policy.
+Feel free to open a thread in the appropiate Channel in our Discord server.
+
+:::
+
+## Prerequisites
+
+- Having your GPU isolated
+- Passed the GPU to your Talos Machine
+
+## Extensions
+
+:::caution[Charts]
+
+This guide assumes you are using Clustertool for your talos cluster. The steps may differ otherwise.
+
+:::
+
+Its important to add the following Extensions to your `talconfig.yaml` for bootstrap:
+
+```yaml
+
+schematic:
+    customization:
+        systemExtensions:
+            officialExtensions:
+                - siderolabs/nonfree-kmod-nvidia-lts
+                - siderolabs/nvidia-container-toolkit-lts
+
+```
+
+## GPU Talos Patch
+
+Additionally, you will need to create the following patch file `gpu.yaml` in the patches folder of clustertool:
+
+```yaml
+
+- op: add
+  path: /machine/kernel
+  value:
+    modules:
+     - "name": "nvidia"
+     - "name": "nvidia_uvm"
+     - "name": "nvidia_drm"
+     - "name": "nvidia_modeset"
+- op: add
+  path: /machine/sysctls
+  value:
+    "net.core.bpf_jit_harden": 1 
+
+```
+
+This patch file then needs to be added to the `talconfig.yaml`:
+
+```yaml
+
+patches:
+    - "@./patches/gpu.yaml"
+
+```
+
+## Testing
+
+Run the following commands and see if the shown outputs are included in your command-output:
+
+### Check the Modules
+
+```bash
+talosctl read /proc/modules
+```
+
+Output (the numbers and hex values may be different):
+
+```bash
+nvidia_uvm 1884160 - - Live 0xffffffffc3f79000 (PO)
+nvidia_drm 94208 - - Live 0xffffffffc42f6000 (PO)
+nvidia_modeset 1531904 - - Live 0xffffffffc4159000 (PO)
+nvidia 62754816 - - Live 0xffffffffc039e000 (PO)
+```
+
+### Check the Extensions
+
+```bash
+talosctl get extensions
+```
+
+Output (the numbers and hex values may be different):
+
+```bash
+192.168.178.9   runtime     ExtensionStatus   3             1         nonfree-kmod-nvidia-lts        535.183.06-v1.8.4
+192.168.178.9   runtime     ExtensionStatus   4             1         nvidia-container-toolkit-lts   535.183.06-v1.16.1
+```
+
+### Read Driver Version
+
+```bash
+talosctl read /proc/driver/nvidia/version
+```
+
+Output (the numbers and hex values may be different):
+
+```bash
+NVRM version: NVIDIA UNIX x86_64 Kernel Module  535.183.06  Wed Jun 26 06:46:07 UTC 2024
+GCC version:  gcc version 13.3.0 (GCC) 
+```
+
+### Testing the GPU
+
+```bash
+kubectl run \
+  nvidia-test \
+  --restart=Never \
+  -ti --rm \
+  --image nvcr.io/nvidia/cuda:12.1.0-base-ubuntu22.04 \
+  --overrides '{"spec": {"runtimeClassName": "nvidia"}}' \
+  nvidia-smi
+```
+
+Output (The Warning about PodSecurity can be ignored):
+
+```bash
+Mon Dec 16 12:54:35 2024       
++---------------------------------------------------------------------------------------+
+| NVIDIA-SMI 535.183.06             Driver Version: 535.183.06   CUDA Version: 12.2     |
+|-----------------------------------------+----------------------+----------------------+
+| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
+| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
+|                                         |                      |               MIG M. |
+|=========================================+======================+======================|
+|   0  NVIDIA GeForce GTX 1660 ...    On  | 00000000:00:0A.0 Off |                  N/A |
+| 24%   26C    P8               5W / 125W |      1MiB /  6144MiB |      0%      Default |
+|                                         |                      |                  N/A |
++-----------------------------------------+----------------------+----------------------+
+
++---------------------------------------------------------------------------------------+
+| Processes:                                                                            |
+|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
+|        ID   ID                                                             Usage      |
+|=======================================================================================|
+|  No running processes found                                                           |
++---------------------------------------------------------------------------------------+
+pod "nvidia-test" deleted
+```
+
+## Nvidia-Device-Plugin
+
+If all of the previous tests where successfull. Your GPU is ready to be used with the Nvidia-Device-Plugin.
+An example `helm-release.yaml` can be seen below:
+
+```yaml
+apiVersion: helm.toolkit.fluxcd.io/v2
+kind: HelmRelease
+metadata:
+    name: nvidia-device-plugin
+    namespace: nvidia-device-plugin
+spec:
+    interval: 5m
+    chart:
+        spec:
+            # renovate: registryUrl=https://charts.truechartsoci.org
+            chart: nvidia-device-plugin
+            version:  0.17.0
+            sourceRef:
+                kind: HelmRepository
+                name: nvdp
+                namespace: flux-system
+            interval: 5m
+    install:
+        createNamespace: true
+        crds: CreateReplace
+        remediation:
+            retries: 3
+    upgrade:
+        crds: CreateReplace
+        remediation:
+            retries: 3
+    values:
+      image:
+        repository: nvcr.io/nvidia/k8s-device-plugin
+        tag: v0.17.0
+      runtimeClassName: nvidia
+      config:
+        map:
+          default: |-
+            version: v1
+            flags:
+              migStrategy: none
+            sharing:
+              timeSlicing:
+                renameByDefault: false
+                failRequestsGreaterThanOne: false
+                resources:
+                  - name: nvidia.com/gpu
+                    replicas: 5
+      gfd:
+        enabled: true
+```
+
+Don't forget to add the required repository `nvdp.yaml` into the `repositories/helm` folder and adding it to the required `kustomization.yaml`:
+
+```yaml
+---
+apiVersion: source.toolkit.fluxcd.io/v1
+kind: HelmRepository
+metadata:
+  name: nvdp
+  namespace: flux-system
+spec:
+  interval: 1h
+  url: https://nvidia.github.io/k8s-device-plugin
+  timeout: 3m
+```
+
+## Example of GPU Assignment
+
+The following shows an example on how to add the GPU to a chart. Depending on the chart you may need to adapt the workload-name.
+
+```yaml
+resources:
+    limits:
+        nvidia.com/gpu: 1
+workload:
+    main:
+        podSpec:
+        runtimeClassName: "nvidia"
+```
+
+If you followed this guide the GPU can be assigned up to 5 different charts.
+The number `1` will always be the same and wont be increased for a second chart with gpu usage.
+
+## Troubleshooting
+
+If all the Extensions, Modules and the Driver Version is there but the GPU-Testing shows something similar to:
+
+```bash
+failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running prestart hook #0: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
+nvidia-container-cli: mount error: failed to add device rules: unable to generate new device filter program from existing programs: unable to create new device filters program: load program: invalid argument: last insn is not an exit or jmp
+processed 0 insns (limit 1000000) max_states_per_insn 0 total_states 0 peak_states 0 mark_read 0
+```
+
+Then the patch wasnt addded properly. This can be fixed by manually adding the patch with the following command:
+
+```bash
+talosctl patch mc --patch @gpu.yaml
+```
+
+This should fix the error and should display the desired output
@@ -0,0 +1,26 @@
+---
+title: Staff Cluster
+sidebar:
+  order: 1
+---
+
+:::caution[Charts]
+
+Clusters are personal and may contain stuff not covered by support or Charts from different sources.
+
+:::
+
+List Personal Clusters of our staff members. These can be used as a reference and/or template when adding your own Helm Charts to the cluster.
+
+## Alfi0812
+
+- [Talos-Cluster](https://github.com/alfi0812/talos-cluster)
+- [Test Cluster](https://github.com/alfi0812/test-cluster)
+
+## kqmaverick
+
+- [Cluster](https://github.com/kqmaverick/cluster)
+
+## PrivatePuffin
+
+- [Cluster](https://github.com/PrivatePuffin/cluster)