You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
During cluster bootstrap, the drivers are installed but they are not available as they are not loaded. It appears that a reboot must be done before nvidia-smi becomes available. As the nvidia drivers are not loaded, the command below will fail:
Status: Downloaded newer image for daskdev/dask:latest
docker: Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: initialization error: nvml error: driver not loaded: unknown.
If GPUs are being used, the default image should already have drivers installed and useable or alternatively after driver install the nvidia driver should be loaded without requiring a reboot.
The mandatory presence of the --gpus=all flag is also a problem when using container optimized OS (COS). I can run GPU examples in the Ubuntu based CUDA docker images following the instructions at https://cloud.google.com/container-optimized-os/docs/how-to/run-gpus#e2e, but the --gpus=all flag is not needed and does not work when using nvidia-container-runtime.
kwargs needed to make COS work, if the --gpus=all flag was not there.
During cluster bootstrap, the drivers are installed but they are not available as they are not loaded. It appears that a reboot must be done before
nvidia-smi
becomes available. As the nvidia drivers are not loaded, the command below will fail:cloud-init-output.log
If GPUs are being used, the default image should already have drivers installed and useable or alternatively after driver install the nvidia driver should be loaded without requiring a reboot.
Environment:
The text was updated successfully, but these errors were encountered: