-
Notifications
You must be signed in to change notification settings - Fork 934
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gpu: Support GPU passthrough to LXD containers using Container Device Interface (CDI) #13562
gpu: Support GPU passthrough to LXD containers using Container Device Interface (CDI) #13562
Conversation
6b149e2
to
6a371b5
Compare
Heads up @mionaalex - the "Documentation" label was applied to this issue. |
@gabrielmougard as discussed in the 1:1 lets try and use the existing |
/cc @elezar |
/cc @zvonkok |
fc964b9
to
6357d56
Compare
6357d56
to
5eb4773
Compare
381a348
to
d9e031c
Compare
d9e031c
to
498591c
Compare
@tomponline this current implementation works for the dGPU / iGPU passthrough (working on the documentation) with docker nested inside a LXD container (docker inside reequires #cloud-config
package_update: true
packages:
- docker.io
write_files:
- path: /etc/docker/daemon.json
permissions: '0644'
owner: root:root
content: |
{
"max-concurrent-downloads": 12,
"max-concurrent-uploads": 12,
"runtimes": {
"nvidia": {
"args": [],
"path": "nvidia-container-runtime"
}
}
}
- path: /root/run_tensorrt.sh
permissions: '0755'
owner: root:root
content: |
#!/bin/bash
echo "OS release,Kernel version"
(. /etc/os-release; echo "${PRETTY_NAME}"; uname -r) | paste -s -d,
echo
nvidia-smi -q
echo
exec bash -o pipefail -c "
cd /workspace/tensorrt/samples
make -j4
cd /workspace/tensorrt/bin
./sample_onnx_mnist
retstatus=\${PIPESTATUS[0]}
echo \"Test exited with status code: \${retstatus}\" >&2
exit \${retstatus}
"
runcmd:
- systemctl start docker
- systemctl enable docker
- usermod -aG docker root
- curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
- curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
- apt-get update
- DEBIAN_FRONTEND=noninteractive apt-get install -y nvidia-container-toolkit
- nvidia-ctk runtime configure
- systemctl restart docker Then create the instance with the GPU lxc init ubuntu:jammy t1 --config security.nesting=true --config security.privileged=true
lxc config set t1 cloud-init.user-data - < cloud-init.yml
# If the machine has a dGPU
lxc config device add t1 dgpu0 gpu gputype=physical id=nvidia.com/gpu=gpu0
# O, if the machine has an iGPU
lxc config device add t1 igpu0 gpu gputype=physical id=nvidia.com/gpu=igpu0
lxc start t1
lxc shell t1
root@t1 # docker run --gpus all --rm -v $(pwd):/sh_input nvcr.io/nvidia/tensorrt:24.02-py3 bash /sh_input/run_tensorrt.sh If you passsed an iGPU, when you enter the container, go to the
|
3f436ad
to
f5e527c
Compare
6ef0893
to
42d7b5b
Compare
42d7b5b
to
d4c2b57
Compare
@gabrielmougard can you also check that first commit as there is a persistent failure downloading the go deps on the tests, might need to refresh that commit so it reflects the current state in main |
The new `github.com/NVIDIA/nvidia-container-toolkit` is required because we need to use the `nvcdi` package in order to be able to generate an NVIDIA implementation of a CDI specification. All the devices / mounts discoverability logic is encapsulated in this library. Signed-off-by: Gabriel Mougard <[email protected]>
Signed-off-by: Gabriel Mougard <[email protected]>
Signed-off-by: Gabriel Mougard <[email protected]>
In order to log the internal discovery operations of the CDI library, we created a `CDILogger` type reusing the existing LXD's shared logger , but with slightly modified method prototypes to comply with the CDI logger interface. Signed-off-by: Gabriel Mougard <[email protected]>
Signed-off-by: Gabriel Mougard <[email protected]>
Signed-off-by: Gabriel Mougard <[email protected]>
…pport CDI naming for `phyical` gputype Signed-off-by: Gabriel Mougard <[email protected]>
…mounts) logic Signed-off-by: Gabriel Mougard <[email protected]>
…ringToUint32` Signed-off-by: Gabriel Mougard <[email protected]>
Signed-off-by: Gabriel Mougard <[email protected]>
Signed-off-by: Gabriel Mougard <[email protected]>
…ntns` callhook if CDI devices have been configured Signed-off-by: Gabriel Mougard <[email protected]>
…mount` for executing the CDI hooks Signed-off-by: Gabriel Mougard <[email protected]>
In a select statement, only one case is executed when a channel operation succeeds so once a case is selected and its associated code block is executed, the select statement automatically terminates (there's no fall-through behavior like in a switch statement). So when the response case is selected (i.e., a value is received from the channel), the code inside that case is executed, and then the select statement naturally completes. The break statement here doesn't break out of any additional loop or switch statement; it's just breaking out of the select, which would happen anyway. Signed-off-by: Gabriel Mougard <[email protected]>
Signed-off-by: Gabriel Mougard <[email protected]>
Signed-off-by: Gabriel Mougard <[email protected]>
5ac51bd
to
2421c34
Compare
@tomponline updated. Although there seems to be a new issue with the clustering tests (that I also saw this morning with #13995) |
That seems related to the dqlite ppa, have flagged to dqlite team but not related to your PR. Thanks for flagging though. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
if d.Major == 0 || d.Minor == 0 { | ||
stat := unix.Stat_t{} | ||
err := unix.Stat(hostPath, &stat) | ||
if err != nil { | ||
return err | ||
} | ||
|
||
d.Major = int64(unix.Major(uint64(stat.Rdev))) | ||
d.Minor = int64(unix.Minor(uint64(stat.Rdev))) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a follow-up question: Would it make sense for us to expose the fillMissingInfo
logic that is used in CDI through a public API: https://github.com/cncf-tags/container-device-interface/blob/main/pkg/cdi/container-edits_unix.go#L60
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That could be a nice add yes :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please can an issue be logged to track this @gabrielmougard in GH thanks
if mount.HostPath == "" || mount.ContainerPath == "" { | ||
return nil, fmt.Errorf("The hostPath or containerPath is empty in the CDI mount: %v", *mount) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that strictly speaking it is possible to define a tmpfs
which may have an empty HostPath
and the type
field set to "tmpfs"
. See https://github.com/opencontainers/runc/blob/346b818dad833a5ae8ab55d670b716dadd45950e/libcontainer/rootfs_linux.go#L521-L533
The source / host path is only required for a bind
mount.
Abstract
Don't rely exclusively on
nvidia-container-cli
in LXC to configure a GPU passthrough. As the former tool is being deprecated and won't work with NVIDIA Tegra iGPU, we use thenvidia-container-toolkit
(see here) related tools to generate a source of truth instructing LXD on what to pass (card devices, non-card devices and runtime libraries) to the container namespace.Rationale
The reason behind this change is related to the need for supporting Nvidia iGPU passthrough for LXD containers (e.g, Tegra SoC like AGX/IGX Orin boards). The current implementation of the Nvidia GPU passthrough for LXD containers is based on the
nvidia-container-cli
tool, which is not compatible with the iGPU passthrough (nvidia-configure-cli
and other tools likenvidia-smi
use NVML (Nvidia Management Library) that reads PCI / PCI-X info to properly work. Getting these information for an iGPU living on an AMBA bus and not on a PCI / PCI-X is therefore not compatible with NVML)Specification
A novel approach could consist in leveraging the recent effort from Nvidia to provide a more generic and flexible tool called
nvidia-container-toolkit
.This project really focuses on:Generating, validating and managing a platform configuration file to be integrated with an OCI runtime. This file is describing what device nodes would need to be mounted inside the container to provide the necessary access to the GPU alongside the symlinks to the Nvidia drivers and libraries. The file (JSON format) is supposed to be 'merged' with the OCI image definition (also in JSON format) to provide a complete description of the container runtime environment. The standard uses for this CDI definition can be found here
Providing an runtime shim to allow GPU device passthrough in OCI based container manager solutions.
One might be wondering how this would be useful as LXD does not follow the OCI specification.
The idea is to re-use the generation, validation and management logic of a platform specific configuration file (which follows the CDI specification) but to adapt the mounting logic so that it can be used in a LXC context.
Instead of merging this CDI representation with the OCI image definition (which we don't have anyway), we'll read the device nodes entries and associate them to
lxc.mount.entry
elements. We might also have to add the associatedlxc.cgroup*.devices.allow
entries to allow the container to access the GPU device nodes.Implementation details
In the
id
field (gputype=physical
andgputype=mig
), possibility to describe by CDI identifierThe
id
GPU config option is meant to receive the DRM card ID of the parent GPU device. We'll augment this option to also accept the CDI identifier of the GPU device. In the same fashion as OCI device passing, it means thatid
could be for example{VENDOR_DOMAIN_NAME}/gpu=gpu{INDEX}
for each (non-MIG-enabled is the vendor is NVIDIA) full GPU in the system. For MIG-enabled GPUs, theid
would be{VENDOR_DOMAIN_NAME}/gpu=mig{GPU_INDEX}:{MIG_INDEX}
. We'll also have{VENDOR_DOMAIN_NAME}/gpu=all
which will potentially create multiple LXD GPU devices in the container, one for each GPU in the system.Having a CDI identifier like this gives us a way to target an iGPU that does not live on a PCIe bus (no
pci
config option possible)For the iGPU, we'll not introduce a new
gputype
but we'll usephysical
except that theid
will have to be a CDI identifier (because there is no PCI address to map to the device).This approach is ideal for the end user because we did not introduce a new
gputype
for the iGPU nor changed the config options for the other gputypes. Theid
option is simply more flexible and can accept a DRM card ID or a CDI identifier (in this case, it'll overridevendorid
,productid
,pci
and evenmig.*
if the card is MIG-enabled). The rest of the machinery is hidden from the user.With this change, we built up this development roadmap:
Augmenting
gputype=physical
: theid
config options need to be augmented to support a CDI identifier. We need more validation rules. Handle theall
case.Augmenting
gputype=mig
: theid
config options need to be augmented a CDI identifier. Same as above. Handle theall
case.CDI spec generation: we need to generate the CDI spec each time we start the gpu device (in the
startContainer
function).CDI spec translation: if a CDI spec has been successfully generated and validated against the GPU device the user has queried, we need to translate this spec into actionable LXC mount entries + cgroup perm. The hooks need to be adapted to our format.
LXC driver: detect if a GPU device has been started with CDI. If so, redirect the LXC hook from being
hooks/nvidia
to beingcdi-hook
.Creation of the
cdi-hook
binary in the LXD project. This binary will be responsible for executing all the hooks (pivoting to the container rootfs then update the ldcache, and then create the symlinks). This binary source code should live at the top-level project structure (like other LXD tools). That way, we (Canonical) really own this hook and do not force the Linux Containers project to maintain it as part of theirhooks
folder. In case of a CDI usage, we just executecdi-hook
instead of thehooks/nvidia
hook in LXC for the Linux Container project.Adapt snap packaging to include the
cdi-hook
binary.Real-hardware testing: we can run a CUDA workload inside a container to check that the devices and the runtime libraries have been passed through.
API changes
No API changes expected.
CLI changes
No CLI changes expected.
Database changes
No database changes expected.
TODO:
Fixes #12525