gpu: Support GPU passthrough to LXD containers using Container Device Interface (CDI) #13562

gabrielmougard · 2024-06-06T16:30:24Z

Abstract

Don't rely exclusively on nvidia-container-cli in LXC to configure a GPU passthrough. As the former tool is being deprecated and won't work with NVIDIA Tegra iGPU, we use the nvidia-container-toolkit (see here) related tools to generate a source of truth instructing LXD on what to pass (card devices, non-card devices and runtime libraries) to the container namespace.

Rationale

The reason behind this change is related to the need for supporting Nvidia iGPU passthrough for LXD containers (e.g, Tegra SoC like AGX/IGX Orin boards). The current implementation of the Nvidia GPU passthrough for LXD containers is based on the nvidia-container-cli tool, which is not compatible with the iGPU passthrough (nvidia-configure-cli and other tools like nvidia-smi use NVML (Nvidia Management Library) that reads PCI / PCI-X info to properly work. Getting these information for an iGPU living on an AMBA bus and not on a PCI / PCI-X is therefore not compatible with NVML)

Specification

A novel approach could consist in leveraging the recent effort from Nvidia to provide a more generic and flexible tool called nvidia-container-toolkit.This project really focuses on:

Generating, validating and managing a platform configuration file to be integrated with an OCI runtime. This file is describing what device nodes would need to be mounted inside the container to provide the necessary access to the GPU alongside the symlinks to the Nvidia drivers and libraries. The file (JSON format) is supposed to be 'merged' with the OCI image definition (also in JSON format) to provide a complete description of the container runtime environment. The standard uses for this CDI definition can be found here
Providing an runtime shim to allow GPU device passthrough in OCI based container manager solutions.

One might be wondering how this would be useful as LXD does not follow the OCI specification.
The idea is to re-use the generation, validation and management logic of a platform specific configuration file (which follows the CDI specification) but to adapt the mounting logic so that it can be used in a LXC context.

Instead of merging this CDI representation with the OCI image definition (which we don't have anyway), we'll read the device nodes entries and associate them to lxc.mount.entry elements. We might also have to add the associated lxc.cgroup*.devices.allow entries to allow the container to access the GPU device nodes.

Implementation details

In the `id` field (`gputype=physical` and `gputype=mig`), possibility to describe by CDI identifier

The id GPU config option is meant to receive the DRM card ID of the parent GPU device. We'll augment this option to also accept the CDI identifier of the GPU device. In the same fashion as OCI device passing, it means that id could be for example {VENDOR_DOMAIN_NAME}/gpu=gpu{INDEX} for each (non-MIG-enabled is the vendor is NVIDIA) full GPU in the system. For MIG-enabled GPUs, the id would be {VENDOR_DOMAIN_NAME}/gpu=mig{GPU_INDEX}:{MIG_INDEX}. We'll also have {VENDOR_DOMAIN_NAME}/gpu=all which will potentially create multiple LXD GPU devices in the container, one for each GPU in the system.

Having a CDI identifier like this gives us a way to target an iGPU that does not live on a PCIe bus (no pci config option possible)

For the iGPU, we'll not introduce a new gputype but we'll use physical except that the id will have to be a CDI identifier (because there is no PCI address to map to the device).

This approach is ideal for the end user because we did not introduce a new gputype for the iGPU nor changed the config options for the other gputypes. The id option is simply more flexible and can accept a DRM card ID or a CDI identifier (in this case, it'll override vendorid, productid, pci and even mig.* if the card is MIG-enabled). The rest of the machinery is hidden from the user.

With this change, we built up this development roadmap:

Augmenting gputype=physical: the id config options need to be augmented to support a CDI identifier. We need more validation rules. Handle the all case.
Augmenting gputype=mig: the id config options need to be augmented a CDI identifier. Same as above. Handle the all case.
CDI spec generation: we need to generate the CDI spec each time we start the gpu device (in the startContainer function).
CDI spec translation: if a CDI spec has been successfully generated and validated against the GPU device the user has queried, we need to translate this spec into actionable LXC mount entries + cgroup perm. The hooks need to be adapted to our format.
LXC driver: detect if a GPU device has been started with CDI. If so, redirect the LXC hook from being hooks/nvidia to being cdi-hook.
Creation of the cdi-hook binary in the LXD project. This binary will be responsible for executing all the hooks (pivoting to the container rootfs then update the ldcache, and then create the symlinks). This binary source code should live at the top-level project structure (like other LXD tools). That way, we (Canonical) really own this hook and do not force the Linux Containers project to maintain it as part of their hooks folder. In case of a CDI usage, we just execute cdi-hook instead of the hooks/nvidia hook in LXC for the Linux Container project.
Adapt snap packaging to include the cdi-hook binary.
Real-hardware testing: we can run a CUDA workload inside a container to check that the devices and the runtime libraries have been passed through.

API changes

No API changes expected.

CLI changes

No CLI changes expected.

Database changes

No database changes expected.

TODO:

PoC tool
snap package + integration test with dGPU + iGPU

Fixes #12525

github-actions · 2024-06-06T16:47:33Z

Heads up @mionaalex - the "Documentation" label was applied to this issue.

doc/config_options.txt

tomponline · 2024-06-10T14:39:21Z

@gabrielmougard as discussed in the 1:1 lets try and use the existing unix, disk and raw.lxc settings to confirm the theory that this allows iGPU cards to be used in the the container, and if so then we can move onto discussing what the user experience and implementation will be. Thanks

elezar · 2024-06-11T14:30:22Z

/cc @elezar

zvonkok · 2024-06-11T15:00:43Z

/cc @zvonkok

gabrielmougard · 2024-06-20T17:40:12Z

@tomponline this current implementation works for the dGPU / iGPU passthrough (working on the documentation) with docker nested inside a LXD container (docker inside reequires security.nesting=true and security.privileged=true). We can use this cloud init script cloud-init.yaml to test it:

#cloud-config
package_update: true
packages:
  - docker.io
write_files:
  - path: /etc/docker/daemon.json
    permissions: '0644'
    owner: root:root
    content: |
      {
        "max-concurrent-downloads": 12,
        "max-concurrent-uploads": 12, 
        "runtimes": {
          "nvidia": {
            "args": [],
            "path": "nvidia-container-runtime"
          }
        }
      }
  - path: /root/run_tensorrt.sh
    permissions: '0755'
    owner: root:root
    content: |
      #!/bin/bash
      echo "OS release,Kernel version"
      (. /etc/os-release; echo "${PRETTY_NAME}"; uname -r) | paste -s -d,
      echo
      nvidia-smi -q
      echo
      exec bash -o pipefail -c "
      cd /workspace/tensorrt/samples
      make -j4
      cd /workspace/tensorrt/bin
      ./sample_onnx_mnist
      retstatus=\${PIPESTATUS[0]}
      echo \"Test exited with status code: \${retstatus}\" >&2
      exit \${retstatus}
      "
runcmd:
  - systemctl start docker
  - systemctl enable docker
  - usermod -aG docker root
  - curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
  - curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
  - apt-get update
  - DEBIAN_FRONTEND=noninteractive apt-get install -y nvidia-container-toolkit
  - nvidia-ctk runtime configure
  - systemctl restart docker

Then create the instance with the GPU

lxc init ubuntu:jammy t1 --config security.nesting=true --config security.privileged=true
lxc config set t1 cloud-init.user-data - < cloud-init.yml

# If the machine has a dGPU
lxc config device add t1 dgpu0 gpu gputype=physical id=nvidia.com/gpu=gpu0

# O, if the machine has an iGPU
lxc config device add t1 igpu0 gpu gputype=physical id=nvidia.com/gpu=igpu0

lxc start t1

lxc shell t1
root@t1 # docker run --gpus all --rm -v $(pwd):/sh_input nvcr.io/nvidia/tensorrt:24.02-py3 bash /sh_input/run_tensorrt.sh

If you passsed an iGPU, when you enter the container, go to the /etc/nvidia-container-runtime/config.toml and use mode=csv instead of mode=auto, then execute the docker command:

docker run --gpus all --rm -v $(pwd):/sh_input nvcr.io/nvidia/tensorrt:24.02-py3-igpu bash /sh_input/run_tensorrt.sh

lxd/device/gpu.go

doc/reference/devices_gpu.md

doc/.custom_wordlist.txt

doc/reference/devices_gpu.md

tomponline · 2024-08-28T07:21:27Z

@gabrielmougard can you also check that first commit as there is a persistent failure downloading the go deps on the tests, might need to refresh that commit so it reflects the current state in main

The new `github.com/NVIDIA/nvidia-container-toolkit` is required because we need to use the `nvcdi` package in order to be able to generate an NVIDIA implementation of a CDI specification. All the devices / mounts discoverability logic is encapsulated in this library. Signed-off-by: Gabriel Mougard <[email protected]>

Signed-off-by: Gabriel Mougard <[email protected]>

In order to log the internal discovery operations of the CDI library, we created a `CDILogger` type reusing the existing LXD's shared logger , but with slightly modified method prototypes to comply with the CDI logger interface. Signed-off-by: Gabriel Mougard <[email protected]>

Signed-off-by: Gabriel Mougard <[email protected]>

…pport CDI naming for `phyical` gputype Signed-off-by: Gabriel Mougard <[email protected]>

…mounts) logic Signed-off-by: Gabriel Mougard <[email protected]>

…ringToUint32` Signed-off-by: Gabriel Mougard <[email protected]>

Signed-off-by: Gabriel Mougard <[email protected]>

…ntns` callhook if CDI devices have been configured Signed-off-by: Gabriel Mougard <[email protected]>

…mount` for executing the CDI hooks Signed-off-by: Gabriel Mougard <[email protected]>

In a select statement, only one case is executed when a channel operation succeeds so once a case is selected and its associated code block is executed, the select statement automatically terminates (there's no fall-through behavior like in a switch statement). So when the response case is selected (i.e., a value is received from the channel), the code inside that case is executed, and then the select statement naturally completes. The break statement here doesn't break out of any additional loop or switch statement; it's just breaking out of the select, which would happen anyway. Signed-off-by: Gabriel Mougard <[email protected]>

Signed-off-by: Gabriel Mougard <[email protected]>

gabrielmougard · 2024-08-28T09:03:07Z

@tomponline updated. Although there seems to be a new issue with the clustering tests (that I also saw this morning with #13995)

tomponline · 2024-08-28T09:05:16Z

@tomponline updated. Although there seems to be a new issue with the clustering tests (that I also saw this morning with #13995)

That seems related to the dqlite ppa, have flagged to dqlite team but not related to your PR. Thanks for flagging though.

tomponline

Thanks!

elezar · 2024-08-28T09:37:03Z

lxd/device/cdi/configure.go

+	if d.Major == 0 || d.Minor == 0 {
+		stat := unix.Stat_t{}
+		err := unix.Stat(hostPath, &stat)
+		if err != nil {
+			return err
+		}
+
+		d.Major = int64(unix.Major(uint64(stat.Rdev)))
+		d.Minor = int64(unix.Minor(uint64(stat.Rdev)))
+	}


As a follow-up question: Would it make sense for us to expose the fillMissingInfo logic that is used in CDI through a public API: https://github.com/cncf-tags/container-device-interface/blob/main/pkg/cdi/container-edits_unix.go#L60

That could be a nice add yes :)

Please can an issue be logged to track this @gabrielmougard in GH thanks

elezar · 2024-08-28T09:48:39Z

lxd/device/cdi/configure.go

+		if mount.HostPath == "" || mount.ContainerPath == "" {
+			return nil, fmt.Errorf("The hostPath or containerPath is empty in the CDI mount: %v", *mount)
+		}


Note that strictly speaking it is possible to define a tmpfs which may have an empty HostPath and the type field set to "tmpfs". See https://github.com/opencontainers/runc/blob/346b818dad833a5ae8ab55d670b716dadd45950e/libcontainer/rootfs_linux.go#L521-L533

The source / host path is only required for a bind mount.

related to: canonical/lxd#13562

Add an API extension for adding the CDI mode for GPU passthrough (see #13562)

gabrielmougard force-pushed the feat/igpu-container-passthrough branch from 6b149e2 to 6a371b5 Compare June 6, 2024 16:47

gabrielmougard self-assigned this Jun 6, 2024

github-actions bot added the Documentation Documentation needs updating label Jun 6, 2024

ru-fu reviewed Jun 7, 2024

View reviewed changes

doc/config_options.txt Outdated Show resolved Hide resolved

gabrielmougard force-pushed the feat/igpu-container-passthrough branch 5 times, most recently from fc964b9 to 6357d56 Compare June 18, 2024 20:53

gabrielmougard force-pushed the feat/igpu-container-passthrough branch from 6357d56 to 5eb4773 Compare June 20, 2024 12:02

gabrielmougard mentioned this pull request Jun 20, 2024

snapcraft: Add nvidia-ctk as part of the binary tool for Container Device Interface spec generation canonical/lxd-pkg-snap#470

Merged

gabrielmougard marked this pull request as ready for review June 20, 2024 12:08

gabrielmougard requested a review from tomponline as a code owner June 20, 2024 12:08

gabrielmougard force-pushed the feat/igpu-container-passthrough branch 2 times, most recently from 381a348 to d9e031c Compare June 20, 2024 15:36

gabrielmougard mentioned this pull request Jun 20, 2024

Discoverability issue during CDI spec generation when the program lives inside a snap container NVIDIA/nvidia-container-toolkit#554

Open

gabrielmougard force-pushed the feat/igpu-container-passthrough branch from d9e031c to 498591c Compare June 20, 2024 17:28

gabrielmougard force-pushed the feat/igpu-container-passthrough branch 2 times, most recently from 3f436ad to f5e527c Compare June 21, 2024 09:39

gabrielmougard requested review from ru-fu and elezar June 21, 2024 09:39

gabrielmougard force-pushed the feat/igpu-container-passthrough branch 2 times, most recently from 6ef0893 to 42d7b5b Compare June 21, 2024 11:28

ru-fu reviewed Jun 21, 2024

View reviewed changes

gabrielmougard force-pushed the feat/igpu-container-passthrough branch from 42d7b5b to d4c2b57 Compare June 21, 2024 14:53

gabrielmougard added 16 commits August 28, 2024 10:50

lxd/device/cdi: Introduce base CDI types and ToCDI

1b1c20d

Signed-off-by: Gabriel Mougard <[email protected]>

lxd/device/cdi: Add unit tests for the CDI identifier parser

8548e7c

Signed-off-by: Gabriel Mougard <[email protected]>

lxd/device/cdi: Add the NVIDIA CDI spec generator

ba948c5

Signed-off-by: Gabriel Mougard <[email protected]>

lxd/device/cdi: Add base CDI Hook types

58c7a88

Signed-off-by: Gabriel Mougard <[email protected]>

lxd/device/gpu: Augment the capabilities of the id GPU option to su…

da314e6

…pport CDI naming for `phyical` gputype Signed-off-by: Gabriel Mougard <[email protected]>

lxd/device/cdi: Add CDI translation logic (OCI spec -> unix device + …

41b8840

…mounts) logic Signed-off-by: Gabriel Mougard <[email protected]>

lxd/device/gpu_physical: more explicit output values for `deviceNumSt…

582127e

…ringToUint32` Signed-off-by: Gabriel Mougard <[email protected]>

lxd/device/gpu_physical: Configure a GPU (physical) if CDI detected

e6944c4

Signed-off-by: Gabriel Mougard <[email protected]>

lxd/instance/drivers: Make removeUnixDevices a driver_common function

37b4cd3

Signed-off-by: Gabriel Mougard <[email protected]>

lxd/instance/drivers/driver_lxc: Configure LXC to use LXD's `startmou…

650879b

…ntns` callhook if CDI devices have been configured Signed-off-by: Gabriel Mougard <[email protected]>

lxd: Setup the startmountns callhook command to react to `lxc.hook.…

1ce1973

…mount` for executing the CDI hooks Signed-off-by: Gabriel Mougard <[email protected]>

lxd-metadata: update metadata

3730250

Signed-off-by: Gabriel Mougard <[email protected]>

doc: Add documentation on how to add a GPU with CDI mode

2421c34

Signed-off-by: Gabriel Mougard <[email protected]>

gabrielmougard force-pushed the feat/igpu-container-passthrough branch from 5ac51bd to 2421c34 Compare August 28, 2024 08:51

tomponline approved these changes Aug 28, 2024

View reviewed changes

tomponline merged commit c2862fa into canonical:main Aug 28, 2024
27 of 28 checks passed

elezar reviewed Aug 28, 2024

View reviewed changes

This was referenced Aug 28, 2024

Follow-up fixes related to NVIDIA CDI #14002

Open

Remove nvidia-container-cli #14009

Open

tomponline added a commit to canonical/lxd-ci that referenced this pull request Sep 3, 2024

Add iGPU passthrough test for container (#247)

a366532

related to: canonical/lxd#13562

hamistao pushed a commit to hamistao/lxd-ci that referenced this pull request Sep 6, 2024

Add iGPU passthrough test for container (canonical#247)

dc9b4df

related to: canonical/lxd#13562

gabrielmougard mentioned this pull request Nov 20, 2024

api: Add gpu_cdi extension #14501

Merged

tomponline added a commit that referenced this pull request Nov 21, 2024

api: Add gpu_cdi extension (#14501)

3c2809b

Add an API extension for adding the CDI mode for GPU passthrough (see #13562)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gpu: Support GPU passthrough to LXD containers using Container Device Interface (CDI) #13562

gpu: Support GPU passthrough to LXD containers using Container Device Interface (CDI) #13562

gabrielmougard commented Jun 6, 2024 •

edited by tomponline

Loading

github-actions bot commented Jun 6, 2024 •

edited

Loading

tomponline commented Jun 10, 2024

elezar commented Jun 11, 2024

zvonkok commented Jun 11, 2024

gabrielmougard commented Jun 20, 2024

tomponline commented Aug 28, 2024

gabrielmougard commented Aug 28, 2024 •

edited

Loading

tomponline commented Aug 28, 2024

tomponline left a comment

elezar Aug 28, 2024

gabrielmougard Aug 28, 2024

tomponline Aug 28, 2024

elezar Aug 28, 2024

gpu: Support GPU passthrough to LXD containers using Container Device Interface (CDI) #13562

gpu: Support GPU passthrough to LXD containers using Container Device Interface (CDI) #13562

Conversation

gabrielmougard commented Jun 6, 2024 • edited by tomponline Loading

Abstract

Rationale

Specification

Implementation details

In the id field (gputype=physical and gputype=mig), possibility to describe by CDI identifier

API changes

CLI changes

Database changes

github-actions bot commented Jun 6, 2024 • edited Loading

tomponline commented Jun 10, 2024

elezar commented Jun 11, 2024

zvonkok commented Jun 11, 2024

gabrielmougard commented Jun 20, 2024

tomponline commented Aug 28, 2024

gabrielmougard commented Aug 28, 2024 • edited Loading

tomponline commented Aug 28, 2024

tomponline left a comment

Choose a reason for hiding this comment

elezar Aug 28, 2024

Choose a reason for hiding this comment

gabrielmougard Aug 28, 2024

Choose a reason for hiding this comment

tomponline Aug 28, 2024

Choose a reason for hiding this comment

elezar Aug 28, 2024

Choose a reason for hiding this comment

gabrielmougard commented Jun 6, 2024 •

edited by tomponline

Loading

In the `id` field (`gputype=physical` and `gputype=mig`), possibility to describe by CDI identifier

github-actions bot commented Jun 6, 2024 •

edited

Loading

gabrielmougard commented Aug 28, 2024 •

edited

Loading