Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to create CUDAExecutionProvider. #17537

Closed
jFkd1 opened this issue Sep 13, 2023 · 13 comments
Closed

Failed to create CUDAExecutionProvider. #17537

jFkd1 opened this issue Sep 13, 2023 · 13 comments
Labels
ep:CUDA issues related to the CUDA execution provider

Comments

@jFkd1
Copy link

jFkd1 commented Sep 13, 2023

Describe the issue

For some reason, onnxruntime-gpu is having a hard time using CUDAExecutionProvider.

Using Cuda 11.7 with onnxruntime-gpu=1.15.1, I tried running the following in python 3.10:

import torch
import onnxruntime
save_path = 'model.onnx'
ort_session = onnxruntime.InferenceSession(save_path, providers=['CUDAExecutionProvider'])

And I will always get the error message:

[W:onnxruntime:Default, onnxruntime_pybind_state.cc:640 CreateExecutionProviderInstance] Failed to create CUDAExecutionProvider. Please reference https://onnxruntime.ai/docs/execution-providers/CUDA-ExecutionProvider.html#requirements to ensure all dependencies are met.

It looks like I am using the right system and versions. I will upload the model onnx if necessary.

To reproduce

pip install onnxruntime-gpu==1.15.1

Using cuda 11.7

Run this code:

import torch
import onnxruntime
save_path = 'model.onnx'
ort_session = onnxruntime.InferenceSession(save_path, providers=['CUDAExecutionProvider'])

Urgency

Very urgent

Platform

Linux

OS Version

Ubuntu 18.04.6 LTS

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.15.1

ONNX Runtime API

Python

Architecture

X64

Execution Provider

CUDA

Execution Provider Library Version

CUDA 11.7

@github-actions github-actions bot added the ep:CUDA issues related to the CUDA execution provider label Sep 13, 2023
@jFkd1
Copy link
Author

jFkd1 commented Sep 13, 2023

@tianleiwu
Copy link
Contributor

It is not related to model. Most likely related to package installation or cuDNN installation. Are you able to reproduce it in a new python environment, or reinstall like the following?

pip3 uninstall onnxruntime onnxruntime-gpu ort-nightly ort-nightly-gpu
pip3 install onnxruntime-gpu
pip3 install torch torchvision torchaudio --force-reinstall --index-url https://download.pytorch.org/whl/cu117

Then test with your python script

import torch
import onnxruntime
save_path = 'model.onnx'
ort_session = onnxruntime.InferenceSession(save_path, providers=['CUDAExecutionProvider'])

Since you imported torch before onnxruntime, ORT shall be able to use the cuDNN loaded by torch.

@snnn
Copy link
Member

snnn commented Sep 14, 2023

You can also try our nightly build.

To get it, first uninstall the one you have installed:

         python3 -m pip uninstall -y ort-nightly-gpu ort-nightly onnxruntime onnxruntime-gpu -qq

Then get a new one from our nightly feed:

python3 -m pip install coloredlogs flatbuffers numpy packaging protobuf sympy
python3 -m pip install -i https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/ORT-Nightly/pypi/simple/ ort-nightly-gpu

Then you can use the ldd tool to figure out which library was missing.

@jFkd1
Copy link
Author

jFkd1 commented Sep 14, 2023

hi @tianleiwu , thanks for commenting. I tried reinstalling it but it does not seem to help. I have also tried this on a different server (with the exact same cuda/python environment) but am still getting the same error.

@snnn thanks for the suggestion. Looks like the nightly build also is not working properly... The only difference I see is just that the error output is now colored yellow.

Can you specify on what you mean by using ldd?

@snnn
Copy link
Member

snnn commented Sep 14, 2023

To where the package is installed, you may find some *.so files. For example, if you run

find  . -name \*.so -exec ldd {} \;

in the directory where ort-nightly-gpu python package is installed, the ldd tool should show some errors saying something is not found.

@tianleiwu
Copy link
Contributor

@jFkd1, Example of ldd for ort-nightly-gpu in Ubuntu 20.04 with CUDA 11.8:

/workspace/anaconda3/envs/sdxl/lib/python3.10/site-packages/onnxruntime/capi$ find  . -name \*.so -exec ldd {} \;
        linux-vdso.so.1 (0x00007ffc8c363000)
        libcublasLt.so.11 => /usr/local/cuda-11.8/targets/x86_64-linux/lib/libcublasLt.so.11 (0x00007fb91f713000)
        libcublas.so.11 => /usr/local/cuda-11.8/targets/x86_64-linux/lib/libcublas.so.11 (0x00007fb919ab5000)
        libcudnn.so.8 => /usr/local/cuda-11.8/targets/x86_64-linux/lib/libcudnn.so.8 (0x00007fb91988f000)
        libcurand.so.10 => /usr/local/cuda/targets/x86_64-linux/lib/libcurand.so.10 (0x00007fb9133f9000)
        libcufft.so.10 => /usr/local/cuda-11.8/targets/x86_64-linux/lib/libcufft.so.10 (0x00007fb90251e000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fb9024fb000)
        librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007fb9024ef000)
        libcudart.so.11.0 => /usr/local/cuda-11.8/targets/x86_64-linux/lib/libcudart.so.11.0 (0x00007fb902248000)
        libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007fb901fda000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fb901e8b000)
        libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007fb901e66000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fb901c74000)
        /lib64/ld-linux-x86-64.so.2 (0x00007fb95de91000)
        libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fb901c6c000)
        linux-vdso.so.1 (0x00007ffca0965000)
        libcudnn.so.8 => /usr/local/cuda-11.8/targets/x86_64-linux/lib/libcudnn.so.8 (0x00007f3703763000)
        libcublas.so.11 => /usr/local/cuda-11.8/targets/x86_64-linux/lib/libcublas.so.11 (0x00007f36fdb05000)
        libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f36fdaff000)
        libnvinfer-e0e3f4ae.so.8.6.1 => /workspace/anaconda3/envs/sdxl/lib/python3.10/site-packages/onnxruntime/capi/./../../ort_nightly_gpu.libs/libnvinfer-e0e3f4ae.so.8.6.1 (0x00007f36ef250000)
        libnvinfer_plugin-e94f6113.so.8.6.1 => /workspace/anaconda3/envs/sdxl/lib/python3.10/site-packages/onnxruntime/capi/./../../ort_nightly_gpu.libs/libnvinfer_plugin-e94f6113.so.8.6.1 (0x00007f36ecd64000)
        libnvonnxparser-216e92d7.so.8.6.1 => /workspace/anaconda3/envs/sdxl/lib/python3.10/site-packages/onnxruntime/capi/./../../ort_nightly_gpu.libs/libnvonnxparser-216e92d7.so.8.6.1 (0x00007f36ec8a9000)
        libcudart.so.11.0 => /usr/local/cuda-11.8/targets/x86_64-linux/lib/libcudart.so.11.0 (0x00007f36ec602000)
        libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007f36ec5e6000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f36ec5c3000)
        librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f36ec5b9000)
        libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f36ec34b000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f36ec1fa000)
        libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f36ec1d5000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f36ebfe3000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f3703c58000)
        libcublasLt.so.11 => /usr/local/cuda-11.8/targets/x86_64-linux/lib/libcublasLt.so.11 (0x00007f36c7a5d000)
        libcublas-6571bf83.so.12.0.2.224 => /workspace/anaconda3/envs/sdxl/lib/python3.10/site-packages/onnxruntime/capi/./../../ort_nightly_gpu.libs/libcublas-6571bf83.so.12.0.2.224 (0x00007f36c11d0000)
        libcublasLt-5bae62ee.so.12.0.2.224 => /workspace/anaconda3/envs/sdxl/lib/python3.10/site-packages/onnxruntime/capi/./../../ort_nightly_gpu.libs/libcublasLt-5bae62ee.so.12.0.2.224 (0x00007f36a03b8000)
        linux-vdso.so.1 (0x00007ffebf1d3000)
        libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f6a9ee4f000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f6a9ed00000)
        libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f6a9ecdb000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f6a9eae9000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f6a9f2d8000)
        linux-vdso.so.1 (0x00007ffc83315000)
        libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f5a5726d000)
        librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f5a57263000)
        libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007f5a57247000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f5a57224000)
        libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f5a56fb6000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f5a56e67000)
        libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f5a56e40000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f5a56c4e000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f5a5825f000)

@snnn
Copy link
Member

snnn commented Sep 14, 2023

It seems totally fine. As tianleiwu said, the "import torch" statement could be the reason. If you load pytorch before loading onnxruntime, onnxruntime probably will get cuda and cudnn libs from pytorch instead and there might be version mismatch happened. Can you try to create a simpler script that only has onnxruntime?

@snnn
Copy link
Member

snnn commented Sep 14, 2023

The latest nightly package has some problems. But the latest release package is fine. I just tested it under RHEL8 and it worked fine. The running environment had the following packages:

cuda-toolkit-11-config-common-11.8.89-1.noarch
cuda-cudart-11-8-11.8.89-1.x86_64
cuda-nvrtc-11-8-11.8.89-1.x86_64
libnccl-2.15.5-1+cuda11.8.x86_64
libcudnn8-8.9.0.131-1.cuda11.8.x86_64
cuda-toolkit-config-common-12.1.105-1.noarch
cuda-toolkit-11-8-config-common-11.8.89-1.noarch
cuda-compat-11-8-520.61.05-1.x86_64
cuda-libraries-11-8-11.8.0-1.x86_64
cuda-nvtx-11-8-11.8.86-1.x86_64

@jFkd1
Copy link
Author

jFkd1 commented Sep 14, 2023

Thanks for the responses.

I tried not importing torch and that did not seem to mitigate the issue. I was only importing torch because it was believed to solve the CUDA dependency issue with ort for some. Unfortunately it did not work for me.

Here's what I get from running find . -name \*.so -exec ldd {} \ in my ort-nightly install:

/qitian/miniconda3/envs/roop/lib/python3.10/site-packages/onnxruntime/capi$ find  . -name \*.so -exec ldd {} \;
        linux-vdso.so.1 (0x00007ffff7ffb000)
        libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007ffff6bf4000)
        librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007ffff69ec000)
        libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007ffff67cf000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007ffff65b0000)
        libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007ffff6227000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007ffff5e89000)
        libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007ffff5c71000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007ffff5880000)
        /lib64/ld-linux-x86-64.so.2 (0x00007ffff7dd3000)
        linux-vdso.so.1 (0x00007ffff7ffb000)
        libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007ffff7848000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007ffff74aa000)
        libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007ffff7292000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007ffff6ea1000)
        /lib64/ld-linux-x86-64.so.2 (0x00007ffff7dd3000)
Inconsistency detected by ld.so: dl-version.c: 205: _dl_check_map_versions: Assertion `needed != NULL' failed!
Inconsistency detected by ld.so: dl-version.c: 205: _dl_check_map_versions: Assertion `needed != NULL' failed!

A quick internet search suggests this may be related to #9754. I will try the suggested fix with reinstalling libcudnn8 and see if that would help.

@snnn
Copy link
Member

snnn commented Sep 14, 2023

I will make a new nightly package for you to test, which will not have the #9754 issue. But I will need a few days.

@mahesh11T
Copy link

following,
FAIL : Failed to load library libonnxruntime_providers_cuda.so with error: libcurand.so.10: cannot open shared object file: No such file or directory
#/usr/local/lib/python3.8/site-packages/onnxruntime/capi/libonnxruntime_providers_cuda.so:
issue solved?

@snnn
Copy link
Member

snnn commented Sep 21, 2023

Right, as it said, your operating system cannot find libcurand.so.10. Did you install CUDA and CUDNN?

@snnn
Copy link
Member

snnn commented Sep 21, 2023

Feel free to create a new issue if the problem still exists. Please note CUDA and CUDNN are commercial software owned by Nvidia with End User License Agreements. Our team do not redistribute their software due to license restrictions and security concerns. Even if we do, there is no way to prove to you the redistributed files are genuine. The latest ONNX Runtime release was built with CUDA 11.8 and CUDNN 8.9. Users of ONNX Runtime GPU packages need to get the dependent libraries from Nvidia. If any dependent library is missing, on Linux ONNX Runtime should be able to print out a detailed error message when loading its CUDA execution provider. The message can tell what was missing.
A side note: if you need to have multiple CUDA versions in parallel, it's better to not add any of them to ld.so's global database(/etc/ld.so.cache) and you need to add one of them to LD_LIBRARY_PATH environment variable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ep:CUDA issues related to the CUDA execution provider
Projects
None yet
Development

No branches or pull requests

4 participants