Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gpu: Support GPU passthrough to LXD containers using Container Device Interface (CDI) #13562

Merged

Conversation

gabrielmougard
Copy link
Contributor

@gabrielmougard gabrielmougard commented Jun 6, 2024

Abstract

Don't rely exclusively on nvidia-container-cli in LXC to configure a GPU passthrough. As the former tool is being deprecated and won't work with NVIDIA Tegra iGPU, we use the nvidia-container-toolkit (see here) related tools to generate a source of truth instructing LXD on what to pass (card devices, non-card devices and runtime libraries) to the container namespace.

Rationale

The reason behind this change is related to the need for supporting Nvidia iGPU passthrough for LXD containers (e.g, Tegra SoC like AGX/IGX Orin boards). The current implementation of the Nvidia GPU passthrough for LXD containers is based on the nvidia-container-cli tool, which is not compatible with the iGPU passthrough (nvidia-configure-cli and other tools like nvidia-smi use NVML (Nvidia Management Library) that reads PCI / PCI-X info to properly work. Getting these information for an iGPU living on an AMBA bus and not on a PCI / PCI-X is therefore not compatible with NVML)

Specification

A novel approach could consist in leveraging the recent effort from Nvidia to provide a more generic and flexible tool called nvidia-container-toolkit.This project really focuses on:

  1. Generating, validating and managing a platform configuration file to be integrated with an OCI runtime. This file is describing what device nodes would need to be mounted inside the container to provide the necessary access to the GPU alongside the symlinks to the Nvidia drivers and libraries. The file (JSON format) is supposed to be 'merged' with the OCI image definition (also in JSON format) to provide a complete description of the container runtime environment. The standard uses for this CDI definition can be found here

  2. Providing an runtime shim to allow GPU device passthrough in OCI based container manager solutions.

One might be wondering how this would be useful as LXD does not follow the OCI specification.
The idea is to re-use the generation, validation and management logic of a platform specific configuration file (which follows the CDI specification) but to adapt the mounting logic so that it can be used in a LXC context.

Instead of merging this CDI representation with the OCI image definition (which we don't have anyway), we'll read the device nodes entries and associate them to lxc.mount.entry elements. We might also have to add the associated lxc.cgroup*.devices.allow entries to allow the container to access the GPU device nodes.

Implementation details

In the id field (gputype=physical and gputype=mig), possibility to describe by CDI identifier

The id GPU config option is meant to receive the DRM card ID of the parent GPU device. We'll augment this option to also accept the CDI identifier of the GPU device. In the same fashion as OCI device passing, it means that id could be for example {VENDOR_DOMAIN_NAME}/gpu=gpu{INDEX} for each (non-MIG-enabled is the vendor is NVIDIA) full GPU in the system. For MIG-enabled GPUs, the id would be {VENDOR_DOMAIN_NAME}/gpu=mig{GPU_INDEX}:{MIG_INDEX}. We'll also have {VENDOR_DOMAIN_NAME}/gpu=all which will potentially create multiple LXD GPU devices in the container, one for each GPU in the system.

Having a CDI identifier like this gives us a way to target an iGPU that does not live on a PCIe bus (no pci config option possible)

For the iGPU, we'll not introduce a new gputype but we'll use physical except that the id will have to be a CDI identifier (because there is no PCI address to map to the device).

This approach is ideal for the end user because we did not introduce a new gputype for the iGPU nor changed the config options for the other gputypes. The id option is simply more flexible and can accept a DRM card ID or a CDI identifier (in this case, it'll override vendorid, productid, pci and even mig.* if the card is MIG-enabled). The rest of the machinery is hidden from the user.

With this change, we built up this development roadmap:

  1. Augmenting gputype=physical: the id config options need to be augmented to support a CDI identifier. We need more validation rules. Handle the all case.

  2. Augmenting gputype=mig: the id config options need to be augmented a CDI identifier. Same as above. Handle the all case.

  3. CDI spec generation: we need to generate the CDI spec each time we start the gpu device (in the startContainer function).

  4. CDI spec translation: if a CDI spec has been successfully generated and validated against the GPU device the user has queried, we need to translate this spec into actionable LXC mount entries + cgroup perm. The hooks need to be adapted to our format.

  5. LXC driver: detect if a GPU device has been started with CDI. If so, redirect the LXC hook from being hooks/nvidia to being cdi-hook.

  6. Creation of the cdi-hook binary in the LXD project. This binary will be responsible for executing all the hooks (pivoting to the container rootfs then update the ldcache, and then create the symlinks). This binary source code should live at the top-level project structure (like other LXD tools). That way, we (Canonical) really own this hook and do not force the Linux Containers project to maintain it as part of their hooks folder. In case of a CDI usage, we just execute cdi-hook instead of the hooks/nvidia hook in LXC for the Linux Container project.

  7. Adapt snap packaging to include the cdi-hook binary.

  8. Real-hardware testing: we can run a CUDA workload inside a container to check that the devices and the runtime libraries have been passed through.

API changes

No API changes expected.

CLI changes

No CLI changes expected.

Database changes

No database changes expected.


TODO:

  • PoC tool
  • snap package + integration test with dGPU + iGPU

Fixes #12525

@gabrielmougard gabrielmougard force-pushed the feat/igpu-container-passthrough branch from 6b149e2 to 6a371b5 Compare June 6, 2024 16:47
@gabrielmougard gabrielmougard self-assigned this Jun 6, 2024
@github-actions github-actions bot added the Documentation Documentation needs updating label Jun 6, 2024
Copy link

github-actions bot commented Jun 6, 2024

Heads up @mionaalex - the "Documentation" label was applied to this issue.

doc/config_options.txt Outdated Show resolved Hide resolved
@tomponline
Copy link
Member

@gabrielmougard as discussed in the 1:1 lets try and use the existing unix, disk and raw.lxc settings to confirm the theory that this allows iGPU cards to be used in the the container, and if so then we can move onto discussing what the user experience and implementation will be. Thanks

@elezar
Copy link

elezar commented Jun 11, 2024

/cc @elezar

@zvonkok
Copy link

zvonkok commented Jun 11, 2024

/cc @zvonkok

@gabrielmougard
Copy link
Contributor Author

@tomponline this current implementation works for the dGPU / iGPU passthrough (working on the documentation) with docker nested inside a LXD container (docker inside reequires security.nesting=true and security.privileged=true). We can use this cloud init script cloud-init.yaml to test it:

#cloud-config
package_update: true
packages:
  - docker.io
write_files:
  - path: /etc/docker/daemon.json
    permissions: '0644'
    owner: root:root
    content: |
      {
        "max-concurrent-downloads": 12,
        "max-concurrent-uploads": 12, 
        "runtimes": {
          "nvidia": {
            "args": [],
            "path": "nvidia-container-runtime"
          }
        }
      }
  - path: /root/run_tensorrt.sh
    permissions: '0755'
    owner: root:root
    content: |
      #!/bin/bash
      echo "OS release,Kernel version"
      (. /etc/os-release; echo "${PRETTY_NAME}"; uname -r) | paste -s -d,
      echo
      nvidia-smi -q
      echo
      exec bash -o pipefail -c "
      cd /workspace/tensorrt/samples
      make -j4
      cd /workspace/tensorrt/bin
      ./sample_onnx_mnist
      retstatus=\${PIPESTATUS[0]}
      echo \"Test exited with status code: \${retstatus}\" >&2
      exit \${retstatus}
      "
runcmd:
  - systemctl start docker
  - systemctl enable docker
  - usermod -aG docker root
  - curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
  - curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
  - apt-get update
  - DEBIAN_FRONTEND=noninteractive apt-get install -y nvidia-container-toolkit
  - nvidia-ctk runtime configure
  - systemctl restart docker

Then create the instance with the GPU

lxc init ubuntu:jammy t1 --config security.nesting=true --config security.privileged=true
lxc config set t1 cloud-init.user-data - < cloud-init.yml

# If the machine has a dGPU
lxc config device add t1 dgpu0 gpu gputype=physical id=nvidia.com/gpu=gpu0

# O, if the machine has an iGPU
lxc config device add t1 igpu0 gpu gputype=physical id=nvidia.com/gpu=igpu0

lxc start t1

lxc shell t1
root@t1 # docker run --gpus all --rm -v $(pwd):/sh_input nvcr.io/nvidia/tensorrt:24.02-py3 bash /sh_input/run_tensorrt.sh

If you passsed an iGPU, when you enter the container, go to the /etc/nvidia-container-runtime/config.toml and use mode=csv instead of mode=auto, then execute the docker command:

docker run --gpus all --rm -v $(pwd):/sh_input nvcr.io/nvidia/tensorrt:24.02-py3-igpu bash /sh_input/run_tensorrt.sh

@gabrielmougard gabrielmougard force-pushed the feat/igpu-container-passthrough branch 2 times, most recently from 3f436ad to f5e527c Compare June 21, 2024 09:39
@gabrielmougard gabrielmougard requested review from ru-fu and elezar June 21, 2024 09:39
@gabrielmougard gabrielmougard force-pushed the feat/igpu-container-passthrough branch 2 times, most recently from 6ef0893 to 42d7b5b Compare June 21, 2024 11:28
lxd/device/gpu.go Outdated Show resolved Hide resolved
doc/reference/devices_gpu.md Outdated Show resolved Hide resolved
doc/reference/devices_gpu.md Outdated Show resolved Hide resolved
doc/reference/devices_gpu.md Outdated Show resolved Hide resolved
doc/.custom_wordlist.txt Outdated Show resolved Hide resolved
doc/reference/devices_gpu.md Outdated Show resolved Hide resolved
doc/reference/devices_gpu.md Outdated Show resolved Hide resolved
doc/reference/devices_gpu.md Outdated Show resolved Hide resolved
doc/reference/devices_gpu.md Outdated Show resolved Hide resolved
@gabrielmougard gabrielmougard force-pushed the feat/igpu-container-passthrough branch from 42d7b5b to d4c2b57 Compare June 21, 2024 14:53
@tomponline
Copy link
Member

@gabrielmougard can you also check that first commit as there is a persistent failure downloading the go deps on the tests, might need to refresh that commit so it reflects the current state in main

The new `github.com/NVIDIA/nvidia-container-toolkit` is required because we need to use the `nvcdi` package
in order to be able to generate an NVIDIA implementation of a CDI specification. All the devices / mounts discoverability logic
is encapsulated in this library.

Signed-off-by: Gabriel Mougard <[email protected]>
In order to log the internal discovery operations of the CDI library,
we created a `CDILogger` type reusing the existing LXD's shared logger
, but with slightly modified method prototypes to comply with the CDI logger interface.

Signed-off-by: Gabriel Mougard <[email protected]>
…pport CDI naming for `phyical` gputype

Signed-off-by: Gabriel Mougard <[email protected]>
…ntns` callhook if CDI devices have been configured

Signed-off-by: Gabriel Mougard <[email protected]>
…mount` for executing the CDI hooks

Signed-off-by: Gabriel Mougard <[email protected]>
In a select statement, only one case is executed when a channel operation succeeds so once a case is selected and its associated code block is executed, the select statement automatically terminates (there's no fall-through behavior like in a switch statement). So when the response case is selected (i.e., a value is received from the channel), the code inside that case is executed, and then the select statement naturally completes. The break statement here doesn't break out of any additional loop or switch statement; it's just breaking out of the select, which would happen anyway.

Signed-off-by: Gabriel Mougard <[email protected]>
Signed-off-by: Gabriel Mougard <[email protected]>
@gabrielmougard gabrielmougard force-pushed the feat/igpu-container-passthrough branch from 5ac51bd to 2421c34 Compare August 28, 2024 08:51
@gabrielmougard
Copy link
Contributor Author

gabrielmougard commented Aug 28, 2024

@tomponline updated. Although there seems to be a new issue with the clustering tests (that I also saw this morning with #13995)

@tomponline
Copy link
Member

@tomponline updated. Although there seems to be a new issue with the clustering tests (that I also saw this morning with #13995)

That seems related to the dqlite ppa, have flagged to dqlite team but not related to your PR. Thanks for flagging though.

Copy link
Member

@tomponline tomponline left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@tomponline tomponline merged commit c2862fa into canonical:main Aug 28, 2024
27 of 28 checks passed
Comment on lines +29 to +38
if d.Major == 0 || d.Minor == 0 {
stat := unix.Stat_t{}
err := unix.Stat(hostPath, &stat)
if err != nil {
return err
}

d.Major = int64(unix.Major(uint64(stat.Rdev)))
d.Minor = int64(unix.Minor(uint64(stat.Rdev)))
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a follow-up question: Would it make sense for us to expose the fillMissingInfo logic that is used in CDI through a public API: https://github.com/cncf-tags/container-device-interface/blob/main/pkg/cdi/container-edits_unix.go#L60

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That could be a nice add yes :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please can an issue be logged to track this @gabrielmougard in GH thanks

Comment on lines +59 to +61
if mount.HostPath == "" || mount.ContainerPath == "" {
return nil, fmt.Errorf("The hostPath or containerPath is empty in the CDI mount: %v", *mount)
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that strictly speaking it is possible to define a tmpfs which may have an empty HostPath and the type field set to "tmpfs". See https://github.com/opencontainers/runc/blob/346b818dad833a5ae8ab55d670b716dadd45950e/libcontainer/rootfs_linux.go#L521-L533

The source / host path is only required for a bind mount.

tomponline added a commit to canonical/lxd-ci that referenced this pull request Sep 3, 2024
hamistao pushed a commit to hamistao/lxd-ci that referenced this pull request Sep 6, 2024
tomponline added a commit that referenced this pull request Nov 21, 2024
Add an API extension for adding the CDI mode for GPU passthrough (see
#13562)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Documentation Documentation needs updating
Projects
None yet
Development

Successfully merging this pull request may close these issues.

NVIDIA iGPU passthrough Support
7 participants