nvidia-k8s-device-plugin: wait for kubelet.sock before starting #228

cbgbt · 2024-10-30T23:18:02Z

Issue number:

Closes bottlerocket-os/bottlerocket#4250

Description of changes:
It is possible for nvidia-k8s-device-plugin and kubelet to race, causing graphics nodes to fail to expose gpus via kubelet.

Specifically, nvidia-k8s-device-plugin starts after kubelet; however, it depends on the kubelet device-plugin management socket to be available in order to register itself.

The kubelet service does not synchronize its start of the device-plugin management socket with its systemd "notify" signal, which means that kubelet may start before the socket is ready.

If the socket is created after nvidia-k8s-device-pluging starts watching the socket for inotify events, it may trigger the device-plugin's restart logic (the device plugin assumes that kubelet has restarted in this case).

Unfortunately, device plugin restarts seem to be somewhat flaky due to issues discussed in bottlerocket-os/bottlerocket#4250.

This change causes the nvidia-k8s-device-plugin to require kubelet.sock to exist as a socket. The unit will fail to start, and subsequently retry every 2 seconds until the socket is available. We perform an initial sleep, because it turns out that kubelet.sock usually does not exist by the time that systemd tries to start nvidia-k8s-device-plugin.

Testing done:
I created this patch which causes the inotify race to always occur, which massively increased the incidence of the failure case.

After hundreds of instance launches, I have not witnessed a single instance with missing GPU resources (whereas the failure incidence is ~40% on Bottlerocket 1.25.0 with my faulty patch added).

basic node readiness tests
cycle over 1000 instance launches without triggering the bug

Terms of contribution:

By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.

zestrells · 2024-11-14T00:36:00Z

We just encountered the same issue on the newly released AMI. Do we know if this 100% fixed the issue, or just reduced the times it occurs?

osImage: Bottlerocket OS 1.26.2 (aws-k8s-1.29-nvidia)

cbgbt requested review from bcressey, arnaldo2792, KCSesh and yeazelm October 30, 2024 23:18

nvidia-k8s-device-plugin: wait for kubelet.sock before starting

29c4c26

cbgbt force-pushed the nkdp-wait-for-kubelet-sock branch from 890ef27 to 29c4c26 Compare November 1, 2024 19:01

cbgbt mentioned this pull request Nov 1, 2024

Bottlerocket 1.25 is preventing nvidia driver to start bottlerocket-os/bottlerocket#4250

Open

cbgbt marked this pull request as ready for review November 1, 2024 19:34

yeazelm approved these changes Nov 1, 2024

View reviewed changes

bcressey approved these changes Nov 1, 2024

View reviewed changes

cbgbt merged commit 093a732 into bottlerocket-os:develop Nov 1, 2024
2 checks passed

cbgbt deleted the nkdp-wait-for-kubelet-sock branch November 1, 2024 20:18

KCSesh mentioned this pull request Nov 1, 2024

v1.26.2 🎃 Tracking Issue bottlerocket-os/bottlerocket#4278

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nvidia-k8s-device-plugin: wait for kubelet.sock before starting #228

nvidia-k8s-device-plugin: wait for kubelet.sock before starting #228

cbgbt commented Oct 30, 2024 •

edited

Loading

zestrells commented Nov 14, 2024

nvidia-k8s-device-plugin: wait for kubelet.sock before starting #228

nvidia-k8s-device-plugin: wait for kubelet.sock before starting #228

Conversation

cbgbt commented Oct 30, 2024 • edited Loading

zestrells commented Nov 14, 2024

cbgbt commented Oct 30, 2024 •

edited

Loading