diff --git a/website/src/content/docs/guides/talos/gpu.md b/website/src/content/docs/guides/talos/gpu.md new file mode 100644 index 000000000000..94840e2b7818 --- /dev/null +++ b/website/src/content/docs/guides/talos/gpu.md @@ -0,0 +1,253 @@ +--- +title: Adding a Nvidia-GPU to your Cluster +--- + +:::caution[Charts] + +Adding a GPU to your Cluster isn't covered by the Support Policy. +Feel free to open a thread in the appropiate Channel in our Discord server. + +::: + +## Prerequisites + +- Having your GPU isolated +- Passed the GPU to your Talos Machine + +## Extensions + +:::caution[Charts] + +This guide assumes you are using Clustertool for your talos cluster. The steps may differ otherwise. + +::: + +Its important to add the following Extensions to your `talconfig.yaml` for bootstrap: + +```yaml + +schematic: + customization: + systemExtensions: + officialExtensions: + - siderolabs/nonfree-kmod-nvidia-lts + - siderolabs/nvidia-container-toolkit-lts + +``` + +## GPU Talos Patch + +Additionally, you will need to create the following patch file `gpu.yaml` in the patches folder of clustertool: + +```yaml + +- op: add + path: /machine/kernel + value: + modules: + - "name": "nvidia" + - "name": "nvidia_uvm" + - "name": "nvidia_drm" + - "name": "nvidia_modeset" +- op: add + path: /machine/sysctls + value: + "net.core.bpf_jit_harden": 1 + +``` + +This patch file then needs to be added to the `talconfig.yaml`: + +```yaml + +patches: + - "@./patches/gpu.yaml" + +``` + +## Testing + +Run the following commands and see if the shown outputs are included in your command-output: + +### Check the Modules + +```bash +talosctl read /proc/modules +``` + +Output (the numbers and hex values may be different): + +```bash +nvidia_uvm 1884160 - - Live 0xffffffffc3f79000 (PO) +nvidia_drm 94208 - - Live 0xffffffffc42f6000 (PO) +nvidia_modeset 1531904 - - Live 0xffffffffc4159000 (PO) +nvidia 62754816 - - Live 0xffffffffc039e000 (PO) +``` + +### Check the Extensions + +```bash +talosctl get extensions +``` + +Output (the numbers and hex values may be different): + +```bash +192.168.178.9 runtime ExtensionStatus 3 1 nonfree-kmod-nvidia-lts 535.183.06-v1.8.4 +192.168.178.9 runtime ExtensionStatus 4 1 nvidia-container-toolkit-lts 535.183.06-v1.16.1 +``` + +### Read Driver Version + +```bash +talosctl read /proc/driver/nvidia/version +``` + +Output (the numbers and hex values may be different): + +```bash +NVRM version: NVIDIA UNIX x86_64 Kernel Module 535.183.06 Wed Jun 26 06:46:07 UTC 2024 +GCC version: gcc version 13.3.0 (GCC) +``` + +### Testing the GPU + +```bash +kubectl run \ + nvidia-test \ + --restart=Never \ + -ti --rm \ + --image nvcr.io/nvidia/cuda:12.1.0-base-ubuntu22.04 \ + --overrides '{"spec": {"runtimeClassName": "nvidia"}}' \ + nvidia-smi +``` + +Output (The Warning about PodSecurity can be ignored): + +```bash +Mon Dec 16 12:54:35 2024 ++---------------------------------------------------------------------------------------+ +| NVIDIA-SMI 535.183.06 Driver Version: 535.183.06 CUDA Version: 12.2 | +|-----------------------------------------+----------------------+----------------------+ +| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | +| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | +| | | MIG M. | +|=========================================+======================+======================| +| 0 NVIDIA GeForce GTX 1660 ... On | 00000000:00:0A.0 Off | N/A | +| 24% 26C P8 5W / 125W | 1MiB / 6144MiB | 0% Default | +| | | N/A | ++-----------------------------------------+----------------------+----------------------+ + ++---------------------------------------------------------------------------------------+ +| Processes: | +| GPU GI CI PID Type Process name GPU Memory | +| ID ID Usage | +|=======================================================================================| +| No running processes found | ++---------------------------------------------------------------------------------------+ +pod "nvidia-test" deleted +``` + +## Nvidia-Device-Plugin + +If all of the previous tests where successfull. Your GPU is ready to be used with the Nvidia-Device-Plugin. +An example `helm-release.yaml` can be seen below: + +```yaml +apiVersion: helm.toolkit.fluxcd.io/v2 +kind: HelmRelease +metadata: + name: nvidia-device-plugin + namespace: nvidia-device-plugin +spec: + interval: 5m + chart: + spec: + # renovate: registryUrl=https://charts.truechartsoci.org + chart: nvidia-device-plugin + version: 0.17.0 + sourceRef: + kind: HelmRepository + name: nvdp + namespace: flux-system + interval: 5m + install: + createNamespace: true + crds: CreateReplace + remediation: + retries: 3 + upgrade: + crds: CreateReplace + remediation: + retries: 3 + values: + image: + repository: nvcr.io/nvidia/k8s-device-plugin + tag: v0.17.0 + runtimeClassName: nvidia + config: + map: + default: |- + version: v1 + flags: + migStrategy: none + sharing: + timeSlicing: + renameByDefault: false + failRequestsGreaterThanOne: false + resources: + - name: nvidia.com/gpu + replicas: 5 + gfd: + enabled: true +``` + +Don't forget to add the required repository `nvdp.yaml` into the `repositories/helm` folder and adding it to the required `kustomization.yaml`: + +```yaml +--- +apiVersion: source.toolkit.fluxcd.io/v1 +kind: HelmRepository +metadata: + name: nvdp + namespace: flux-system +spec: + interval: 1h + url: https://nvidia.github.io/k8s-device-plugin + timeout: 3m +``` + +## Example of GPU Assignment + +The following shows an example on how to add the GPU to a chart. Depending on the chart you may need to adapt the workload-name. + +```yaml +resources: + limits: + nvidia.com/gpu: 1 +workload: + main: + podSpec: + runtimeClassName: "nvidia" +``` + +If you followed this guide the GPU can be assigned up to 5 different charts. +The number `1` will always be the same and wont be increased for a second chart with gpu usage. + +## Troubleshooting + +If all the Extensions, Modules and the Driver Version is there but the GPU-Testing shows something similar to: + +```bash +failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running prestart hook #0: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy' +nvidia-container-cli: mount error: failed to add device rules: unable to generate new device filter program from existing programs: unable to create new device filters program: load program: invalid argument: last insn is not an exit or jmp +processed 0 insns (limit 1000000) max_states_per_insn 0 total_states 0 peak_states 0 mark_read 0 +``` + +Then the patch wasnt addded properly. This can be fixed by manually adding the patch with the following command: + +```bash +talosctl patch mc --patch @gpu.yaml +``` + +This should fix the error and should display the desired output diff --git a/website/src/content/docs/guides/talos/staff.md b/website/src/content/docs/guides/talos/staff.md new file mode 100644 index 000000000000..2bd1b93b85ef --- /dev/null +++ b/website/src/content/docs/guides/talos/staff.md @@ -0,0 +1,26 @@ +--- +title: Staff Cluster +sidebar: + order: 1 +--- + +:::caution[Charts] + +Clusters are personal and may contain stuff not covered by support or Charts from different sources. + +::: + +List Personal Clusters of our staff members. These can be used as a reference and/or template when adding your own Helm Charts to the cluster. + +## Alfi0812 + +- [Talos-Cluster](https://github.com/alfi0812/talos-cluster) +- [Test Cluster](https://github.com/alfi0812/test-cluster) + +## kqmaverick + +- [Cluster](https://github.com/kqmaverick/cluster) + +## PrivatePuffin + +- [Cluster](https://github.com/PrivatePuffin/cluster) diff --git a/website/src/content/docs/guides/talos.md b/website/src/content/docs/guides/talos/talos.md similarity index 100% rename from website/src/content/docs/guides/talos.md rename to website/src/content/docs/guides/talos/talos.md