Fedora CoreOS official support in all components #696

dfateyev · 2024-04-12T13:31:38Z

As announced in the official documentation, currently there is no support for recent Fedora CoreOS-based workers in Kubernetes. There are no official GPU driver images published, and no official recommendations on how to deploy GPU operator to Kubernetes with Fedora CoreOS hosts.

We currently have Kubernetes solutions in Openstack (it features Fedora CoreOS and containerd). In order to use the GPU operator functionality, we should utilize various hacks and workarounds, along with a custom GPU driver image: running GPU driver image and Toolkit on the nodes separately out of Kubernetes scope, then deploying the GPU operator in Kubernetes, disabling already present features. This deployment approach is pretty cumbersome.

We are interested in the official Fedora CoreOS support both in the operator and GPU driver.
In the ideal scenario, we would like to install the GPU operator to deploy all the components working in Kubernetes with containerd out-of-box. We understand that we might need a custom GPU driver image — but without even initial CoreOS native support it's hard to prepare it.

There were several requests for better support Fedora CoreOS driver images, e.g. #34 and #8, and we would like to extend this request to better support in all GPU operator components.
We understand that "support in all components out-of-box" is a pretty broad subject — but we could start at least from something, gradually improving and testing the functionality.

1. Quick Debug Information

OS/Version: Fedora CoreOS 39
Kernel Version: 6.5.11-300.fc39.x86_64
Container Runtime Type/Version: Containerd
K8s Flavor/Version: Kubernetes (Magnum in Openstack)
GPU Operator Version: gpu-operator-v23.9.2

2. Issue or feature description

We have prepare a custom (unofficial) GPU driver image to use the operator functionality — Fedora's from the repo doesn't work out-of-box, but can start with workarounds. But, nvidia-operator-validator cannot finish the deployment validation, anyway.

3. Steps to reproduce the issue

Install operator to Openstack Magnum cluster with: helm install --wait --generate-name -n gpu-operator --create-namespace nvidia/gpu-operator --set driver.usePrecompiled=true --set driver.version="550.54.15" --set driver.repository="docker.io/dfateyev", where "dfateyev/driver" is a custom GPU driver image for this Kubernetes cluster;
The nvidia-smi cannot address GPU device files: we need to prepare them explicitly like this;
The nvidia-operator-validator fails to start properly (see in the attached logs below).

4. Information to attach

Attached logs: issue-696-logs.zip

The text was updated successfully, but these errors were encountered:

fifofonix · 2024-04-25T11:26:51Z

Out-of-the-box support for FCOS would be great. At this time even having invested some time in deploying via helm with various components disabled, and/or flagged as host installed, I have been unable to get this working.

As a result I've had to fall back to a model where I am running nvidia-device-plugin with the container toolkit installed on the host and the driver container running outside of k8s.

r0k5t4r · 2024-08-08T22:00:55Z

Hi all,

I also tried to get gpu operator working by using some other docker images and I think I ended up with the same problem.

Looking at the logs from dfateyev, I see the same error:

failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli.real: ldcache error: open failed: /run/nvidia/driver/sbin/ldconfig.real: no such file or directory:

I tried to symlink ldconfig in docs to ldconfig.real but since the os is immutable it doesn’t work.

maybe I’m barking at the wrong tree here but maybe not. :)

r0k5t4r · 2024-08-08T22:03:17Z

Here is a similar problem and symlinking solved it:

NVIDIA/nvidia-container-toolkit#147

r0k5t4r · 2024-08-09T19:20:06Z

Hi,

I managed to get the GPU operator working. The driver is not working yet. It was related to the ldconfig.real binary. To fix it you simply need to run a different image for the toolkit pod:

helm install --wait --generate-name
-n gpu-operator --create-namespace
nvidia/gpu-operator
--set driver.enabled=false
--version=v22.9.0
--set toolkit.version=v1.11.0-ubi8

I can successfully build a driver container but it fails to install the driver due to missing kernel headers. The ones that are needed in my case just don’t exist. Weird. But I guess you guys somehow managed to build a container with the drivers so it must be doable. And once that is working the gpu operator is fully operational for Fedora Core OS.

r0k5t4r mentioned this issue Aug 12, 2024

Migrated MR - Fedora CoreOS Stable/Testing/Next Support NVIDIA/gpu-driver-container#8

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fedora CoreOS official support in all components #696

Fedora CoreOS official support in all components #696

dfateyev commented Apr 12, 2024 •

edited

Loading

fifofonix commented Apr 25, 2024

r0k5t4r commented Aug 8, 2024

r0k5t4r commented Aug 8, 2024

r0k5t4r commented Aug 9, 2024

Fedora CoreOS official support in all components #696

Fedora CoreOS official support in all components #696

Comments

dfateyev commented Apr 12, 2024 • edited Loading

1. Quick Debug Information

2. Issue or feature description

3. Steps to reproduce the issue

4. Information to attach

fifofonix commented Apr 25, 2024

r0k5t4r commented Aug 8, 2024

r0k5t4r commented Aug 8, 2024

r0k5t4r commented Aug 9, 2024

dfateyev commented Apr 12, 2024 •

edited

Loading