Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use kubelet-device-plugin API #132

Merged
merged 3 commits into from
Sep 11, 2024

Conversation

arnaldo2792
Copy link
Contributor

@arnaldo2792 arnaldo2792 commented Sep 6, 2024

Issue number:

Related: bottlerocket-os/bottlerocket-settings-sdk#60

Description of changes:

Per bottlerocket-os/bottlerocket-settings-sdk#57 (comment), we are moving away from kubernetes.device-plugin to kubelet-device-plugin.

Testing done:

As part of bottlerocket-os/bottlerocket-settings-sdk#60 and bottlerocket-os/bottlerocket#4182

  1. Instance joined the cluster:
NAME                                           STATUS   ROLES    AGE    VERSION
ip-XXXX.us-west-2.compute.internal   Ready    <none>   22s    v1.30.1-eks-e564799
  1. Files were generated using the new values
bash-5.1# apiclient get settings.kubelet-device-plugin
{
  "settings": {
    "kubelet-device-plugin": {
      "nvidia": {
        "device-id-strategy": "index",
        "device-list-strategy": "volume-mounts",
        "pass-device-specs": true
      }
    }
  }
}

bash-5.1# cat /etc/systemd/system/nvidia-k8s-device-plugin.service.d/exec-start.conf
[Service]
ExecStart=
ExecStart=/usr/bin/nvidia-device-plugin --config-file=/etc/nvidia-k8s-device-plugin/settings.yaml
bash-5.1# cat /etc/nvidia-container-runtime/config.toml
### generated from the template file ###
accept-nvidia-visible-devices-as-volume-mounts = true
accept-nvidia-visible-devices-envvar-when-unprivileged = false

[nvidia-container-cli]
root = "/"
path = "/usr/bin/nvidia-container-cli"
environment = []
ldconfig = "@/sbin/ldconfig"
bash-5.1#
  1. A container created after the settings were changed has access to all GPUs without requesting any:
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: safe-defaults
spec:
  selector:
    matchLabels:
      name: safe-defaults
  template:
    metadata:
      labels:
        name: safe-defaults
    spec:
      # No GPUs requested
      containers:
        - name: safe-defaults
          image: nvidia/cuda:12.4.1-cudnn-devel-rockylinux8
          command: ['sh', '-c', 'sleep infinity']
bash-5.1# apiclient set kubelet-device-plugin.nvidia.device-list-strategy=envvar
bash-5.1# apiclient set nvidia-container-runtime.visible-devices-as-volume-mounts=false
bash-5.1# apiclient set nvidia-container-runtime.visible-devices-envvar-when-unprivileged=true
bash-5.1# cat /etc/nvidia-container-runtime/config.toml
### generated from the template file ###
accept-nvidia-visible-devices-as-volume-mounts = false
accept-nvidia-visible-devices-envvar-when-unprivileged = true

[nvidia-container-cli]
root = "/"
path = "/usr/bin/nvidia-container-cli"
environment = []
ldconfig = "@/sbin/ldconfig"
└─> ❯ k exec safe-defaults-cnsqn -- nvidia-smi
Mon Sep  9 15:31:38 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.4     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A10G                    Off | 00000000:00:1E.0 Off |                    0 |
|  0%   25C    P8               8W / 300W |      0MiB / 23028MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

Terms of contribution:

By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.

git = "https://github.com/bottlerocket-os/bottlerocket-settings-sdk"
tag = "bottlerocket-settings-models-v0.4.0"
version = "0.4.0"
git = "https://github.com/arnaldo2792/bottlerocket-settings-sdk.git"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll drop this once the new Settings SDK is released

@arnaldo2792 arnaldo2792 force-pushed the nvidia-settings-api branch 7 times, most recently from aade1bc to ab5cd8e Compare September 9, 2024 16:00
sources/Cargo.lock Outdated Show resolved Hide resolved
@arnaldo2792
Copy link
Contributor Author

Forced push includes:

  • Remove unnecessary dependency updates in Cargo.lock
  • Skip CI and wait until the Settings SDK is released

@arnaldo2792
Copy link
Contributor Author

(forced push removes hack commit and uses the latest Settings SDK release)

The API shape was changed to kubelet-device-plugin

Signed-off-by: Arnaldo Garcia Rincon <[email protected]>
Signed-off-by: Arnaldo Garcia Rincon <[email protected]>
@arnaldo2792
Copy link
Contributor Author

(forced push fixes conflicts and re-enabled the CI)

@arnaldo2792 arnaldo2792 marked this pull request as ready for review September 10, 2024 22:19
@arnaldo2792 arnaldo2792 merged commit 557a7e5 into bottlerocket-os:develop Sep 11, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants