Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nvidia-k8s-device-plugin: wait for kubelet.sock before starting #228

Merged
merged 1 commit into from
Nov 1, 2024

Conversation

cbgbt
Copy link
Contributor

@cbgbt cbgbt commented Oct 30, 2024

Issue number:

Closes bottlerocket-os/bottlerocket#4250

Description of changes:
It is possible for nvidia-k8s-device-plugin and kubelet to race, causing graphics nodes to fail to expose gpus via kubelet.

Specifically, nvidia-k8s-device-plugin starts after kubelet; however, it depends on the kubelet device-plugin management socket to be available in order to register itself.

The kubelet service does not synchronize its start of the device-plugin management socket with its systemd "notify" signal, which means that kubelet may start before the socket is ready.

If the socket is created after nvidia-k8s-device-pluging starts watching the socket for inotify events, it may trigger the device-plugin's restart logic (the device plugin assumes that kubelet has restarted in this case).

Unfortunately, device plugin restarts seem to be somewhat flaky due to issues discussed in bottlerocket-os/bottlerocket#4250.

This change causes the nvidia-k8s-device-plugin to require kubelet.sock to exist as a socket. The unit will fail to start, and subsequently retry every 2 seconds until the socket is available. We perform an initial sleep, because it turns out that kubelet.sock usually does not exist by the time that systemd tries to start nvidia-k8s-device-plugin.

Testing done:
I created this patch which causes the inotify race to always occur, which massively increased the incidence of the failure case.

After hundreds of instance launches, I have not witnessed a single instance with missing GPU resources (whereas the failure incidence is ~40% on Bottlerocket 1.25.0 with my faulty patch added).

  • basic node readiness tests
  • cycle over 1000 instance launches without triggering the bug

Terms of contribution:

By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.

@cbgbt cbgbt merged commit 093a732 into bottlerocket-os:develop Nov 1, 2024
2 checks passed
@cbgbt cbgbt deleted the nkdp-wait-for-kubelet-sock branch November 1, 2024 20:18
@zestrells
Copy link

We just encountered the same issue on the newly released AMI. Do we know if this 100% fixed the issue, or just reduced the times it occurs?

osImage: Bottlerocket OS 1.26.2 (aws-k8s-1.29-nvidia)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bottlerocket 1.25 is preventing nvidia driver to start
4 participants