Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This patch deprecates the NVIDIA toolkit extension and introduces a new nvidia-driver extension (in production/lts versions and open source/proprietary flavors). The NVIDIA container toolkit must be installed independently, via a future Talos extension, the NVIDIA GPU Operator, or by the cluster administator.
The extension depends on the new glibc extension (#473) and participates in its filesystem subroot by installing all the NVIDIA components in it.
Finally, the extension runs a service that will bind mount this glibc subroot at
/run/nvidia/driver
and run thenvidia-persistenced
daemon.This careful setup allows the NVIDIA GPU Operator to utilize this extension as if it were a traditional NVIDIA driver container.
--
I've tested this extension on my homelab cluster with the current release of the NVIDIA GPU Operator, letting the operator install and configure the NVIDIA Container Toolkit (with my Go wrapper patch, NVIDIA/nvidia-container-toolkit#700).
This is the more Talos way of managing NVIDIA drivers, as opposed to letting the GPU Operator load and unload drivers based on its
ClusterPolicy
orNVIDIADriver
custom resources, as discussed in siderolabs/talos#9339 and #473.This configuration only works in CDI mode, as the "legacy" runtime hook requires more libraries that are removed by this PR.
--
One other requirement on the cluster is to configure the containerd runtime classes. The GPU Operator and the container toolkit installer (which is part of the toolkit and is used by the operator to install the toolkit) have logic to install the runtime classes and patch the containerd config, but this does not work on Talos because the containerd config is synthesized from files that reside on the read-only system partition.
There is a way to install the operator and bypass/disable the containerd configuration. The cluster administrator is then on the hook to do that.
--
There could be a Talos extension for the NVIDIA Container Toolkit. It probably would look a lot like the existing one and maybe even include all the userspace libraries needed for the legacy runtime (basically for
nvidia-container-cli
). For CDI mode support, a service could invokenvidia-ctk
to generate the CDI spec for the devices present on each node (this is a Go binary that only requires glibc and the driver libraries). However, there is some amount of logic in the GPU Operator to configure the toolkit to work with all the other components that the operator may install and manage on the cluster, so a Talos extension for the toolkit would provide a less integrated, possibly less functional experience.