Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

/demo/clusters/kind/create-cluster.sh fails with umount: /proc/driver/nvidia: not mounted #811

Closed
5 of 9 tasks
mbana opened this issue Jul 9, 2024 · 3 comments
Closed
5 of 9 tasks
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.

Comments

@mbana
Copy link

mbana commented Jul 9, 2024

1. Quick Debug Information

  • OS/Version(e.g. RHEL8.6, Ubuntu22.04):
$ cat /etc/os-release                                     
PRETTY_NAME="Ubuntu 22.04.4 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.4 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy
  • Kernel Version:
$ uname -a       
Linux mbana-1 6.5.0-35-generic #35~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Tue May  7 09:00:52 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
  • Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker):
$ nvidia-container-runtime --version
NVIDIA Container Runtime version 1.15.0
commit: ddeeca392c7bd8b33d0a66400b77af7a97e16cef
spec: 1.2.0

runc version 1.1.12
commit: v1.1.12-0-g51d5e94
spec: 1.0.2-dev
go: go1.21.11
libseccomp: 2.5.3
$ docker version                                       
Client: Docker Engine - Community
 Version:           26.1.4
 API version:       1.45
 Go version:        go1.21.11
 Git commit:        5650f9b
 Built:             Wed Jun  5 11:28:57 2024
 OS/Arch:           linux/amd64
 Context:           default

Server: Docker Engine - Community
 Engine:
  Version:          26.1.4
  API version:      1.45 (minimum version 1.24)
  Go version:       go1.21.11
  Git commit:       de5c9cf
  Built:            Wed Jun  5 11:28:57 2024
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.6.33
  GitCommit:        d2d58213f83a351ca8f528a95fbd145f5654e957
 nvidia:
  Version:          1.1.12
  GitCommit:        v1.1.12-0-g51d5e94
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0
$ nvidia-container-cli -V
cli-version: 1.15.0
lib-version: 1.15.0
build date: 2024-04-15T13:36+00:00
build revision: 6c8f1df7fd32cea3280cf2a2c6e931c9b3132465
build compiler: x86_64-linux-gnu-gcc-7 7.5.0
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fplan9-extensions -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections
  • K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS):
$ kind version                                  
kind v0.23.0 go1.22.4 linux/amd6

2. Issue or feature description

The ./demo/clusters/kind/create-cluster.sh seems to fail:

$ ./demo/clusters/kind/create-cluster.sh
+ set -o pipefail
+ source /home/mbana/dev/coreweave/github/nvidia-k8s-device-plugin/demo/clusters/kind/scripts/common.sh
++++ dirname -- /home/mbana/dev/coreweave/github/nvidia-k8s-device-plugin/demo/clusters/kind/scripts/common.sh
+++ cd -- /home/mbana/dev/coreweave/github/nvidia-k8s-device-plugin/demo/clusters/kind/scripts
+++ pwd
++ SCRIPTS_DIR=/home/mbana/dev/coreweave/github/nvidia-k8s-device-plugin/demo/clusters/kind/scripts
++++ dirname -- /home/mbana/dev/coreweave/github/nvidia-k8s-device-plugin/demo/clusters/kind/scripts/../../../../..
+++ cd -- /home/mbana/dev/coreweave/github/nvidia-k8s-device-plugin/demo/clusters/kind/scripts/../../../..
+++ pwd
++ PROJECT_DIR=/home/mbana/dev/coreweave/github/nvidia-k8s-device-plugin
+++ from_versions_mk DRIVER_NAME
+++ local makevar=DRIVER_NAME
++++ grep -E '^\s*DRIVER_NAME\s+[\?:]= ' /home/mbana/dev/coreweave/github/nvidia-k8s-device-plugin/versions.mk
+++ local 'value=DRIVER_NAME := k8s-device-plugin'
+++ echo k8s-device-plugin
++ DRIVER_NAME=k8s-device-plugin
+++ from_versions_mk REGISTRY
+++ local makevar=REGISTRY
++++ grep -E '^\s*REGISTRY\s+[\?:]= ' /home/mbana/dev/coreweave/github/nvidia-k8s-device-plugin/versions.mk
+++ local 'value=REGISTRY ?= nvcr.io/nvidia'
+++ echo nvcr.io/nvidia
++ DRIVER_IMAGE_REGISTRY=nvcr.io/nvidia
+++ from_versions_mk VERSION
+++ local makevar=VERSION
++++ grep -E '^\s*VERSION\s+[\?:]= ' /home/mbana/dev/coreweave/github/nvidia-k8s-device-plugin/versions.mk
+++ local 'value=VERSION ?= v0.16.0-rc.1'
+++ echo v0.16.0-rc.1
++ DRIVER_IMAGE_VERSION=v0.16.0-rc.1
++ : k8s-device-plugin
++ : ubuntu22.04
++ : v0.16.0-rc.1
++ : nvcr.io/nvidia/k8s-device-plugin:v0.16.0-rc.1
++ : v1.29.1
++ : k8s-device-plugin-cluster
++ : /home/mbana/dev/coreweave/github/nvidia-k8s-device-plugin/demo/clusters/kind/scripts/kind-cluster-config.yaml
++ : kindest/node:v1.29.1
+ /home/mbana/dev/coreweave/github/nvidia-k8s-device-plugin/demo/clusters/kind/scripts/build-kind-image.sh
+ set -o pipefail
+ source /home/mbana/dev/coreweave/github/nvidia-k8s-device-plugin/demo/clusters/kind/scripts/common.sh
++++ dirname -- /home/mbana/dev/coreweave/github/nvidia-k8s-device-plugin/demo/clusters/kind/scripts/common.sh
+++ cd -- /home/mbana/dev/coreweave/github/nvidia-k8s-device-plugin/demo/clusters/kind/scripts
+++ pwd
++ SCRIPTS_DIR=/home/mbana/dev/coreweave/github/nvidia-k8s-device-plugin/demo/clusters/kind/scripts
++++ dirname -- /home/mbana/dev/coreweave/github/nvidia-k8s-device-plugin/demo/clusters/kind/scripts/../../../../..
+++ cd -- /home/mbana/dev/coreweave/github/nvidia-k8s-device-plugin/demo/clusters/kind/scripts/../../../..
+++ pwd
++ PROJECT_DIR=/home/mbana/dev/coreweave/github/nvidia-k8s-device-plugin
+++ from_versions_mk DRIVER_NAME
+++ local makevar=DRIVER_NAME
++++ grep -E '^\s*DRIVER_NAME\s+[\?:]= ' /home/mbana/dev/coreweave/github/nvidia-k8s-device-plugin/versions.mk
+++ local 'value=DRIVER_NAME := k8s-device-plugin'
+++ echo k8s-device-plugin
++ DRIVER_NAME=k8s-device-plugin
+++ from_versions_mk REGISTRY
+++ local makevar=REGISTRY
++++ grep -E '^\s*REGISTRY\s+[\?:]= ' /home/mbana/dev/coreweave/github/nvidia-k8s-device-plugin/versions.mk
+++ local 'value=REGISTRY ?= nvcr.io/nvidia'
+++ echo nvcr.io/nvidia
++ DRIVER_IMAGE_REGISTRY=nvcr.io/nvidia
+++ from_versions_mk VERSION
+++ local makevar=VERSION
++++ grep -E '^\s*VERSION\s+[\?:]= ' /home/mbana/dev/coreweave/github/nvidia-k8s-device-plugin/versions.mk
+++ local 'value=VERSION ?= v0.16.0-rc.1'
+++ echo v0.16.0-rc.1
++ DRIVER_IMAGE_VERSION=v0.16.0-rc.1
++ : k8s-device-plugin
++ : ubuntu22.04
++ : v0.16.0-rc.1
++ : nvcr.io/nvidia/k8s-device-plugin:v0.16.0-rc.1
++ : v1.29.1
++ : k8s-device-plugin-cluster
++ : /home/mbana/dev/coreweave/github/nvidia-k8s-device-plugin/demo/clusters/kind/scripts/kind-cluster-config.yaml
++ : kindest/node:v1.29.1
++ docker images --filter reference=kindest/node:v1.29.1 -q
+ EXISTING_IMAGE_ID=171ed79cf912
+ '[' 171ed79cf912 '!=' '' ']'
+ exit 0
+ /home/mbana/dev/coreweave/github/nvidia-k8s-device-plugin/demo/clusters/kind/scripts/create-kind-cluster.sh
+ set -o pipefail
+ source /home/mbana/dev/coreweave/github/nvidia-k8s-device-plugin/demo/clusters/kind/scripts/common.sh
++++ dirname -- /home/mbana/dev/coreweave/github/nvidia-k8s-device-plugin/demo/clusters/kind/scripts/common.sh
+++ cd -- /home/mbana/dev/coreweave/github/nvidia-k8s-device-plugin/demo/clusters/kind/scripts
+++ pwd
++ SCRIPTS_DIR=/home/mbana/dev/coreweave/github/nvidia-k8s-device-plugin/demo/clusters/kind/scripts
++++ dirname -- /home/mbana/dev/coreweave/github/nvidia-k8s-device-plugin/demo/clusters/kind/scripts/../../../../..
+++ cd -- /home/mbana/dev/coreweave/github/nvidia-k8s-device-plugin/demo/clusters/kind/scripts/../../../..
+++ pwd
++ PROJECT_DIR=/home/mbana/dev/coreweave/github/nvidia-k8s-device-plugin
+++ from_versions_mk DRIVER_NAME
+++ local makevar=DRIVER_NAME
++++ grep -E '^\s*DRIVER_NAME\s+[\?:]= ' /home/mbana/dev/coreweave/github/nvidia-k8s-device-plugin/versions.mk
+++ local 'value=DRIVER_NAME := k8s-device-plugin'
+++ echo k8s-device-plugin
++ DRIVER_NAME=k8s-device-plugin
+++ from_versions_mk REGISTRY
+++ local makevar=REGISTRY
++++ grep -E '^\s*REGISTRY\s+[\?:]= ' /home/mbana/dev/coreweave/github/nvidia-k8s-device-plugin/versions.mk
+++ local 'value=REGISTRY ?= nvcr.io/nvidia'
+++ echo nvcr.io/nvidia
++ DRIVER_IMAGE_REGISTRY=nvcr.io/nvidia
+++ from_versions_mk VERSION
+++ local makevar=VERSION
++++ grep -E '^\s*VERSION\s+[\?:]= ' /home/mbana/dev/coreweave/github/nvidia-k8s-device-plugin/versions.mk
+++ local 'value=VERSION ?= v0.16.0-rc.1'
+++ echo v0.16.0-rc.1
++ DRIVER_IMAGE_VERSION=v0.16.0-rc.1
++ : k8s-device-plugin
++ : ubuntu22.04
++ : v0.16.0-rc.1
++ : nvcr.io/nvidia/k8s-device-plugin:v0.16.0-rc.1
++ : v1.29.1
++ : k8s-device-plugin-cluster
++ : /home/mbana/dev/coreweave/github/nvidia-k8s-device-plugin/demo/clusters/kind/scripts/kind-cluster-config.yaml
++ : kindest/node:v1.29.1
+ kind create cluster --retain --name k8s-device-plugin-cluster --image kindest/node:v1.29.1 --config /home/mbana/dev/coreweave/github/nvidia-k8s-device-plugin/demo/clusters/kind/scripts/kind-cluster-config.yaml
Creating cluster "k8s-device-plugin-cluster" ...
 ✓ Ensuring node image (kindest/node:v1.29.1) 🖼
 ✓ Preparing nodes 📦 📦  
 ✓ Writing configuration 📜 
 ✓ Starting control-plane 🕹️ 
 ✓ Installing CNI 🔌 
 ✓ Installing StorageClass 💾 
 ✓ Joining worker nodes 🚜 
Set kubectl context to "kind-k8s-device-plugin-cluster"
You can now use your cluster with:

kubectl cluster-info --context kind-k8s-device-plugin-cluster

Not sure what to do next? 😅  Check out https://kind.sigs.k8s.io/docs/user/quick-start/
+ docker exec -it k8s-device-plugin-cluster-worker umount -R /proc/driver/nvidia
umount: /proc/driver/nvidia: not mounted

3. Information to attach (optional if deemed irrelevant)

Common error checking:

  • The output of nvidia-smi -a on your host
  • Your docker configuration file (e.g: /etc/docker/daemon.json)
  • The k8s-device-plugin container logs
  • The kubelet logs on the node (e.g: sudo journalctl -r -u kubelet)
$  nvidia-smi -a

==============NVSMI LOG==============

Timestamp                                 : Tue Jul  9 11:21:27 2024
Driver Version                            : 550.90.07
CUDA Version                              : 12.4

Attached GPUs                             : 1
GPU 00000000:05:00.0
    Product Name                          : Quadro RTX 4000
    Product Brand                         : Quadro RTX
    Product Architecture                  : Turing
    Display Mode                          : Disabled
    Display Active                        : Disabled
    Persistence Mode                      : Enabled
    Addressing Mode                       : None
    MIG Mode
        Current                           : N/A
        Pending                           : N/A
    Accounting Mode                       : Disabled
    Accounting Mode Buffer Size           : 4000
    Driver Model
        Current                           : N/A
        Pending                           : N/A
    Serial Number                         : 1324121092960
    GPU UUID                              : GPU-fddff5e2-b0d9-3d1e-544a-bc5450cc1092
    Minor Number                          : 0
    VBIOS Version                         : 90.04.87.00.01
    MultiGPU Board                        : No
    Board ID                              : 0x500
    Board Part Number                     : 900-5G160-2550-000
    GPU Part Number                       : 1EB1-850-A1
    FRU Part Number                       : N/A
    Module ID                             : 1
    Inforom Version
        Image Version                     : G160.0500.00.01
        OEM Object                        : 1.1
        ECC Object                        : N/A
        Power Management Object           : N/A
    Inforom BBX Object Flush
        Latest Timestamp                  : N/A
        Latest Duration                   : N/A
    GPU Operation Mode
        Current                           : N/A
        Pending                           : N/A
    GPU C2C Mode                          : N/A
    GPU Virtualization Mode
        Virtualization Mode               : Pass-Through
        Host VGPU Mode                    : N/A
        vGPU Heterogeneous Mode           : N/A
    GPU Reset Status
        Reset Required                    : No
        Drain and Reset Recommended       : N/A
    GSP Firmware Version                  : N/A
    IBMNPU
        Relaxed Ordering Mode             : N/A
    PCI
        Bus                               : 0x05
        Device                            : 0x00
        Domain                            : 0x0000
        Base Classcode                    : 0x3
        Sub Classcode                     : 0x0
        Device Id                         : 0x1EB110DE
        Bus Id                            : 00000000:05:00.0
        Sub System Id                     : 0x12A010DE
        GPU Link Info
            PCIe Generation
                Max                       : 3
                Current                   : 1
                Device Current            : 1
                Device Max                : 3
                Host Max                  : N/A
            Link Width
                Max                       : 16x
                Current                   : 16x
        Bridge Chip
            Type                          : N/A
            Firmware                      : N/A
        Replays Since Reset               : 0
        Replay Number Rollovers           : 0
        Tx Throughput                     : 0 KB/s
        Rx Throughput                     : 0 KB/s
        Atomic Caps Inbound               : N/A
        Atomic Caps Outbound              : N/A
    Fan Speed                             : 30 %
    Performance State                     : P8
    Clocks Event Reasons
        Idle                              : Active
        Applications Clocks Setting       : Not Active
        SW Power Cap                      : Not Active
        HW Slowdown                       : Not Active
            HW Thermal Slowdown           : Not Active
            HW Power Brake Slowdown       : Not Active
        Sync Boost                        : Not Active
        SW Thermal Slowdown               : Not Active
        Display Clock Setting             : Not Active
    Sparse Operation Mode                 : N/A
    FB Memory Usage
        Total                             : 8192 MiB
        Reserved                          : 225 MiB
        Used                              : 1 MiB
        Free                              : 7967 MiB
    BAR1 Memory Usage
        Total                             : 256 MiB
        Used                              : 3 MiB
        Free                              : 253 MiB
    Conf Compute Protected Memory Usage
        Total                             : 0 MiB
        Used                              : 0 MiB
        Free                              : 0 MiB
    Compute Mode                          : Default
    Utilization
        Gpu                               : 0 %
        Memory                            : 0 %
        Encoder                           : 0 %
        Decoder                           : 0 %
        JPEG                              : 0 %
        OFA                               : 0 %
    Encoder Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    FBC Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    ECC Mode
        Current                           : N/A
        Pending                           : N/A
    ECC Errors
        Volatile
            SRAM Correctable              : N/A
            SRAM Uncorrectable            : N/A
            DRAM Correctable              : N/A
            DRAM Uncorrectable            : N/A
        Aggregate
            SRAM Correctable              : N/A
            SRAM Uncorrectable            : N/A
            DRAM Correctable              : N/A
            DRAM Uncorrectable            : N/A
    Retired Pages
        Single Bit ECC                    : N/A
        Double Bit ECC                    : N/A
        Pending Page Blacklist            : N/A
    Remapped Rows                         : N/A
    Temperature
        GPU Current Temp                  : 26 C
        GPU T.Limit Temp                  : N/A
        GPU Shutdown Temp                 : 97 C
        GPU Slowdown Temp                 : 94 C
        GPU Max Operating Temp            : 92 C
        GPU Target Temperature            : 83 C
        Memory Current Temp               : N/A
        Memory Max Operating Temp         : N/A
    GPU Power Readings
        Power Draw                        : 8.75 W
        Current Power Limit               : 125.00 W
        Requested Power Limit             : 125.00 W
        Default Power Limit               : 125.00 W
        Min Power Limit                   : 100.00 W
        Max Power Limit                   : 125.00 W
    GPU Memory Power Readings 
        Power Draw                        : N/A
    Module Power Readings
        Power Draw                        : N/A
        Current Power Limit               : N/A
        Requested Power Limit             : N/A
        Default Power Limit               : N/A
        Min Power Limit                   : N/A
        Max Power Limit                   : N/A
    Clocks
        Graphics                          : 300 MHz
        SM                                : 300 MHz
        Memory                            : 405 MHz
        Video                             : 540 MHz
    Applications Clocks
        Graphics                          : 1005 MHz
        Memory                            : 6501 MHz
    Default Applications Clocks
        Graphics                          : 1005 MHz
        Memory                            : 6501 MHz
    Deferred Clocks
        Memory                            : N/A
    Max Clocks
        Graphics                          : 2100 MHz
        SM                                : 2100 MHz
        Memory                            : 6501 MHz
        Video                             : 1950 MHz
    Max Customer Boost Clocks
        Graphics                          : N/A
    Clock Policy
        Auto Boost                        : N/A
        Auto Boost Default                : N/A
    Voltage
        Graphics                          : N/A
    Fabric
        State                             : N/A
        Status                            : N/A
        CliqueId                          : N/A
        ClusterUUID                       : N/A
        Health
            Bandwidth                     : N/A
    Processes                             : None
$ cat /etc/docker/daemon.json
{
    "default-runtime": "nvidia",
    "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
        }
    },
    "exec-opts": ["native.cgroupdriver=systemd"],
    "bip": "192.168.99.1/24",
    "default-shm-size": "1G",
    "log-driver": "json-file",
    "log-opts": {
        "max-size": "100m",
        "max-file": "1"
    },
    "default-ulimits": {
        "memlock": {
            "hard": -1,
            "name": "memlock",
            "soft": -1
        },
        "stack": {
            "hard": 67108864,
            "name": "stack",
            "soft": 67108864
        }
    }
}

Additional information that might help better understand your environment and reproduce the bug:

  • Docker version from docker version
  • Docker command, image and tag used
  • Kernel version from uname -a
  • Any relevant kernel output lines from dmesg
  • NVIDIA packages version from dpkg -l '*nvidia*' or rpm -qa '*nvidia*'
$ env PAGER=cat dpkg -l '*nvidia*'
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name                                 Version            Architecture Description
+++-====================================-==================-============-=========================================================
un  libgldispatch0-nvidia                <none>             <none>       (no description available)
ii  libnvidia-cfg1-550:amd64             550.90.07-0ubuntu1 amd64        NVIDIA binary OpenGL/GLX configuration library
un  libnvidia-cfg1-any                   <none>             <none>       (no description available)
un  libnvidia-common                     <none>             <none>       (no description available)
ii  libnvidia-common-550                 550.90.07-0ubuntu1 all          Shared files used by the NVIDIA libraries
un  libnvidia-compute                    <none>             <none>       (no description available)
ii  libnvidia-compute-550:amd64          550.90.07-0ubuntu1 amd64        NVIDIA libcompute package
ii  libnvidia-container-tools            1.15.0-1           amd64        NVIDIA container runtime library (command-line tools)
ii  libnvidia-container1:amd64           1.15.0-1           amd64        NVIDIA container runtime library
un  libnvidia-decode                     <none>             <none>       (no description available)
ii  libnvidia-decode-550:amd64           550.90.07-0ubuntu1 amd64        NVIDIA Video Decoding runtime libraries
un  libnvidia-encode                     <none>             <none>       (no description available)
ii  libnvidia-encode-550:amd64           550.90.07-0ubuntu1 amd64        NVENC Video Encoding runtime library
un  libnvidia-extra                      <none>             <none>       (no description available)
ii  libnvidia-extra-550:amd64            550.90.07-0ubuntu1 amd64        Extra libraries for the NVIDIA driver
un  libnvidia-fbc1                       <none>             <none>       (no description available)
ii  libnvidia-fbc1-550:amd64             550.90.07-0ubuntu1 amd64        NVIDIA OpenGL-based Framebuffer Capture runtime library
un  libnvidia-gl                         <none>             <none>       (no description available)
ii  libnvidia-gl-550:amd64               550.90.07-0ubuntu1 amd64        NVIDIA OpenGL/GLX/EGL/GLES GLVND libraries and Vulkan ICD
un  libnvidia-ml.so.1                    <none>             <none>       (no description available)
un  nvidia-384                           <none>             <none>       (no description available)
un  nvidia-390                           <none>             <none>       (no description available)
un  nvidia-common                        <none>             <none>       (no description available)
un  nvidia-compute-utils                 <none>             <none>       (no description available)
ii  nvidia-compute-utils-550             550.90.07-0ubuntu1 amd64        NVIDIA compute utilities
un  nvidia-container-runtime             <none>             <none>       (no description available)
un  nvidia-container-runtime-hook        <none>             <none>       (no description available)
ii  nvidia-container-toolkit             1.15.0-1           amd64        NVIDIA Container toolkit
ii  nvidia-container-toolkit-base        1.15.0-1           amd64        NVIDIA Container Toolkit Base
ii  nvidia-dkms-550                      550.90.07-0ubuntu1 amd64        NVIDIA DKMS package
un  nvidia-dkms-kernel                   <none>             <none>       (no description available)
un  nvidia-docker                        <none>             <none>       (no description available)
ii  nvidia-docker2                       2.13.0-1           all          nvidia-docker CLI wrapper
ii  nvidia-driver-550                    550.90.07-0ubuntu1 amd64        NVIDIA driver metapackage
un  nvidia-driver-550-open               <none>             <none>       (no description available)
un  nvidia-driver-550-server             <none>             <none>       (no description available)
un  nvidia-driver-550-server-open        <none>             <none>       (no description available)
un  nvidia-driver-binary                 <none>             <none>       (no description available)
ii  nvidia-firmware-550-550.54.15        550.54.15-0ubuntu1 amd64        Firmware files used by the kernel module
ii  nvidia-firmware-550-550.90.07        550.90.07-0ubuntu1 amd64        Firmware files used by the kernel module
un  nvidia-firmware-550-server-550.54.15 <none>             <none>       (no description available)
un  nvidia-firmware-550-server-550.90.07 <none>             <none>       (no description available)
un  nvidia-kernel-common                 <none>             <none>       (no description available)
ii  nvidia-kernel-common-550             550.90.07-0ubuntu1 amd64        Shared files used with the kernel module
un  nvidia-kernel-source                 <none>             <none>       (no description available)
ii  nvidia-kernel-source-550             550.90.07-0ubuntu1 amd64        NVIDIA kernel source package
un  nvidia-opencl-icd                    <none>             <none>       (no description available)
un  nvidia-persistenced                  <none>             <none>       (no description available)
ii  nvidia-prime                         0.8.17.1           all          Tools to enable NVIDIA's Prime
ii  nvidia-settings                      555.42.02-0ubuntu1 amd64        Tool for configuring the NVIDIA graphics driver
un  nvidia-settings-binary               <none>             <none>       (no description available)
un  nvidia-smi                           <none>             <none>       (no description available)
un  nvidia-utils                         <none>             <none>       (no description available)
ii  nvidia-utils-550                     550.90.07-0ubuntu1 amd64        NVIDIA driver support binaries
ii  xserver-xorg-video-nvidia-550        550.90.07-0ubuntu1 amd64        NVIDIA binary Xorg driver

I am not sure what the script is trying to do but when I exec into the worker, it is reporting the correct GPU information:

$ docker exec -it k8s-device-plugin-cluster-worker bash
root@k8s-device-plugin-cluster-worker:/# ls -lah /proc/driver/nvidia
total 0
dr-xr-xr-x 11 root root 0 Jul  9 10:16 .
dr-xr-xr-x  8 root root 0 Jul  9 10:16 ..
dr-xr-xr-x  5 root root 0 Jul  9 10:24 capabilities
dr-xr-xr-x  3 root root 0 Jul  9 10:24 gpus
-r--r--r--  1 root root 0 Jul  9 10:24 params
dr-xr-xr-x  3 root root 0 Jul  9 10:24 patches
-rw-r--r--  1 root root 0 Jul  9 10:24 registry
-rw-r--r--  1 root root 0 Jul  9 10:24 suspend
-rw-r--r--  1 root root 0 Jul  9 10:24 suspend_depth
-r--r--r--  1 root root 0 Jul  9 10:24 version
dr-xr-xr-x  3 root root 0 Jul  9 10:24 warnings
root@k8s-device-plugin-cluster-worker:/# ls -lah /proc/driver/nvidia/gpus
total 0
dr-xr-xr-x  3 root root 0 Jul  9 10:24 .
dr-xr-xr-x 11 root root 0 Jul  9 10:16 ..
dr-xr-xr-x  5 root root 0 Jul  9 10:25 0000:05:00.0
root@k8s-device-plugin-cluster-worker:/# cat /proc/driver/nvidia/gpus/0000\:05\:00.0/information
Model: 		 Quadro RTX 4000
IRQ:   		 64
GPU UUID: 	 GPU-fddff5e2-b0d9-3d1e-544a-bc5450cc1092
Video BIOS: 	 90.04.87.00.01
Bus Type: 	 PCIe
DMA Size: 	 47 bits
DMA Mask: 	 0x7fffffffffff
Bus Location: 	 0000:05:00.0
Device Minor: 	 0
GPU Excluded:	 No
@mbana
Copy link
Author

mbana commented Jul 9, 2024

This fixed the startup issue for me:

$ cat /etc/nvidia-container-runtime/config.toml  
# We inject all NVIDIA GPUs using the nvidia-container-runtime.
# This requires `accept-nvidia-visible-devices-as-volume-mounts = true` be set
# in `/etc/nvidia-container-runtime/config.toml`
accept-nvidia-visible-devices-as-volume-mounts = true
...
$ sudo systemctl restart docker && sudo systemctl restart containerd

However when I attempt to run a GPU workload I get the following error:

$ ./demo/clusters/kind/create-cluster.sh
$ ./demo/clusters/kind/install-plugin.sh
$ ```yaml
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  restartPolicy: Never
  containers:
    - name: cuda-container
      image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2
      resources:
        limits:
          nvidia.com/gpu: 1 # requesting 1 GPU
  tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule
EOF
$ kubectl logs gpu-pod            
Failed to allocate device vector A (error code CUDA driver version is insufficient for CUDA runtime version)!
[Vector addition of 50000 elements]

Copy link

github-actions bot commented Oct 8, 2024

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 8, 2024
Copy link

github-actions bot commented Nov 8, 2024

This issue was automatically closed due to inactivity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Nov 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
None yet
Development

No branches or pull requests

1 participant