Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NVIDIA GPU Operator Constantly Resynchronizing in ArgoCD #76

Open
carlmes opened this issue Nov 18, 2024 · 1 comment
Open

NVIDIA GPU Operator Constantly Resynchronizing in ArgoCD #76

carlmes opened this issue Nov 18, 2024 · 1 comment

Comments

@carlmes
Copy link
Member

carlmes commented Nov 18, 2024

When building a brand new cluster using these settings:

  • AWS with OpenShift Envioironment
  • OpenShift version 4.17
  • Control Plane Instance Type: m6a.4xlarge
  • After cluster is provisioned, run bootstrap.sh and choose: 6) rhoai-stable-2.13-aws-gpu

The GPU Operator in the ArgoCD is constantly going out of sync every couple of seconds:

image

Looking at the sync error, we see that Argo is trying to add these lines:

image

But the GPU Operator seems to be removing them.

Also, the monitoring console in OpenShift appears to be broken in this release:

image

It's probably just a new update within the NVIDIA Operator that must be incorporated into our kustomize templates in this project. Source code for the lines that will not synchronize are at:

@carlmes
Copy link
Member Author

carlmes commented Nov 18, 2024

Making a note that this failed twice in a row with 2 brand new 4.17 clusters, but appears to not be an issue in earlier versions such as 4.15 and 4.16 during enablement testing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant