diff --git a/doc/.custom_wordlist.txt b/doc/.custom_wordlist.txt index 49298c1b319d..bb140b499b4d 100644 --- a/doc/.custom_wordlist.txt +++ b/doc/.custom_wordlist.txt @@ -3,6 +3,7 @@ ABI ACL ACLs AGPL +AGX AIO allocator AMD @@ -23,6 +24,7 @@ BPF Btrfs bugfix bugfixes +CDI CentOS Ceph CephFS @@ -43,6 +45,7 @@ cron CSV CUDA dataset +dGPU DCO dereferenced DHCP @@ -92,6 +95,8 @@ IdP idmap idmapped idmaps +iGPU +IGX incrementing InfiniBand init @@ -102,6 +107,7 @@ IPAM IPs IPv IPVLAN +Jetson JIT jq kB @@ -147,6 +153,8 @@ NICs NUMA NVMe NVRAM +NVIDIA +OCI OData OIDC OpenFGA @@ -160,6 +168,7 @@ OverlayFS OVMF OVN OVS +passthrough Pbit PCI PCIe @@ -199,6 +208,7 @@ runtime SATA scalable scriptlet +SDK SDN SDS SDT @@ -214,6 +224,7 @@ SKBPRIO SLAAC SMTP Snapcraft +SoC Solaris SPAs SPL @@ -247,6 +258,8 @@ sysfs syslog Tbit TCP +TensorRT +Tegra TiB Tibit TinyPNG diff --git a/doc/reference/devices_gpu.md b/doc/reference/devices_gpu.md index 26058b60e547..40cc82e03124 100644 --- a/doc/reference/devices_gpu.md +++ b/doc/reference/devices_gpu.md @@ -51,8 +51,96 @@ Add a specific GPU from the host system as a `physical` GPU device to an instanc lxc config device add gpu gputype=physical pci= +Add a specific GPU from the host system as a `physical` GPU device to an instance using the [Container Device Interface](https://github.com/cncf-tags/container-device-interface) (CDI) notation. + + lxc config device add gpu gputype=physical id= + See {ref}`instances-configure-devices` for more information. +#### Passing an NVIDIA iGPU to a container + +Adding a device with the CDI notation is particularly useful if you have NVIDIA runtime libraries and configuration installed on your host and that you want to pass these files to your container. Let's take the example of the iGPU passthrough: + +Your host is an NVIDIA single board computer that has a Tegra SoC with an iGPU. You also have an SDK installed on the host, giving you access to plenty of useful libraries to handle AI workloads. You would want to create a LXD container and run an inference job inside the container using the iGPU as a backend. You would also like the inference job to be ran inside Docker container (or whatever OCI-compliant runtime). You could do something like this: + +Initialize a LXD container: + + lxc init ubuntu:24.04 t1 --config security.nested=true --config security.privileged=true + +Add an iGPU device to your container: + + lxc config device add t1 igpu0 gpu gputype=physical id=nvidia.com/gpu=igpu0 + +Apply a `cloud-init` script to your instance to install the the Docker runtime, the [NVIDIA Container Toolkit](https://github.com/NVIDIA/nvidia-container-toolkit) and a script to run a test [TensorRT](https://github.com/NVIDIA/TensorRT) workload: + +```yaml +#cloud-config +package_update: true +packages: + - docker.io +write_files: + - path: /etc/docker/daemon.json + permissions: '0644' + owner: root:root + content: | + { + "max-concurrent-downloads": 12, + "max-concurrent-uploads": 12, + "runtimes": { + "nvidia": { + "args": [], + "path": "nvidia-container-runtime" + } + } + } + - path: /root/run_tensorrt.sh + permissions: '0755' + owner: root:root + content: | + #!/bin/bash + echo "OS release,Kernel version" + (. /etc/os-release; echo "${PRETTY_NAME}"; uname -r) | paste -s -d, + echo + nvidia-smi -q + echo + exec bash -o pipefail -c " + cd /workspace/tensorrt/samples + make -j4 + cd /workspace/tensorrt/bin + ./sample_onnx_mnist + retstatus=\${PIPESTATUS[0]} + echo \"Test exited with status code: \${retstatus}\" >&2 + exit \${retstatus} + " +runcmd: + - systemctl start docker + - systemctl enable docker + - usermod -aG docker root + - curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg + - curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | tee /etc/apt/sources.list.d/nvidia-container-toolkit.list + - apt-get update + - DEBIAN_FRONTEND=noninteractive apt-get install -y nvidia-container-toolkit + - nvidia-ctk runtime configure + - systemctl restart docker +``` + +Apply this `cloud-init` setup to your instance: + + lxc config set t1 cloud-init.user-data - < cloud-init.yml + +Now you can start the instance: + + lxc start t1 + +Wait for the `cloud-init` process to finish: + + lxc exec t1 -- cloud-init status --wait + +Finally, you can run your inference job inside the LXD container. Note: do not forget to modify the `mode` of the NVIDIA Container Runtime inside the LXD container to the value `csv` and not `auto` if you want to let Docker know that the NVIDIA runtime must be enabled with CSV mode. This configuration file can be found at `/etc/nvidia-container-runtime/config.toml`: + + lxc shell t1 + root@t1 # docker run --gpus all --runtime nvidia --rm -v $(pwd):/sh_input nvcr.io/nvidia/tensorrt:24.02-py3-igpu bash /sh_input/run_tensorrt.sh + (gpu-mdev)= ## `gputype`: `mdev`