Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

onnxruntime-gpu not working with my gpu / setup #21215

Closed
jillson opened this issue Jun 30, 2024 · 9 comments
Closed

onnxruntime-gpu not working with my gpu / setup #21215

jillson opened this issue Jun 30, 2024 · 9 comments
Labels
ep:CUDA issues related to the CUDA execution provider platform:windows issues related to the Windows platform

Comments

@jillson
Copy link

jillson commented Jun 30, 2024

Describe the issue

Configuration:
RTX 3050 (Laptop). nvidia-smi indicates I'm using 555.85 as my driver, and CUDA 12.5, which is slightly confusing in as much as I uninstalled CUDA 12.5 and installed 12.1 based on version compatibility indicated.
CUDNN 8.9.7.29 is installed but doesn't appear to be used
Using pytorch (torch 2.1.2+cu121), onnx 1.16.1, and onnxruntime-gpu 1.18.1

When I try to run the necked down code (see below), I get an error about not being able to load CUDA
024-06-30 15:21:47.1727155 [E:onnxruntime:Default, provider_bridge_ort.cc:1745 onnxruntime::TryGetProviderInfo_CUDA] D:\a\_work\1\s\onnxruntime\core\session\provider_bridge_ort.cc:1426 onnxruntime::ProviderLibrary::Get [ONNXRuntimeError] : 1 : FAIL : LoadLibrary failed with error 126 "" when trying to load "(snip)\venv\lib\site-packages\onnxruntime\capi\onnxruntime_providers_cuda.dll"

To reproduce

Activate my virtualenvironment (which has the versions listed above) and then ran:

import onnxruntime as ort

providers = [("CUDAExecutionProvider", {"device_id": torch.cuda.current_device(),
                                        "user_compute_stream": str(torch.cuda.current_stream().cuda_stream)})]
sess_options = ort.SessionOptions()

import os
import psutil
p = psutil.Process(os.getpid())
for lib in p.memory_maps():
   print(lib.path)

model_path = "./venv/Lib/site-packages/onnx/backend/test/data/node/test_simple_rnn_batchwise/model.onnx"
try:
   sess = ort.InferenceSession(model_path, sess_options=sess_options, providers=providers)
except:
   pass

I get back the error above and also a list of DLLs etc. loaded; these include torch's cudnn (and zlib) [and not the nvidia cuda/cudnn files I installed; at least one stackoverflow post indicated with pytorch I'm using, shouldn't need those as they've helpfully baked it in; I've tried unsetting the CUDNN/CUDA environment variables and removing from $PATH and had same behavior.

Urgency

Very low

Platform

Windows

OS Version

11

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.16.1

ONNX Runtime API

Python

Architecture

X64

Execution Provider

CUDA

Execution Provider Library Version

CUDA 12.1 (or trying to; may be somehow still using CUDA 12.5)

@github-actions github-actions bot added ep:CUDA issues related to the CUDA execution provider platform:windows issues related to the Windows platform labels Jun 30, 2024
@jillson
Copy link
Author

jillson commented Jun 30, 2024

Minor update: https://stackoverflow.com/a/53504578 would seem to indicate my nvdia-smi behavior is expected (as in: I do in fact have 12.1 installed (nvcc --version returns 12.1) but I have the latest (or at least newer) driver that supports up to 12.5 ... given I'm using pytorch 2.1.2+cu121, going to assume I'm effectively running CUDA 12.1 for purposes here)

nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Wed_Feb__8_05:53:42_Coordinated_Universal_Time_2023
Cuda compilation tools, release 12.1, V12.1.66
Build cuda_12.1.r12.1/compiler.32415258_0

@mszhanyi
Copy link
Contributor

mszhanyi commented Jul 1, 2024

Could you use dependency walker to load onnxruntime_providers_cuda.dll and take a look which dependent dll is missing?

@tianleiwu
Copy link
Contributor

1.18.1 for cuda 12 requires cudnn 9.* instead of 8.*. See release note: https://github.com/microsoft/onnxruntime/releases/tag/v1.18.1

@snnn
Copy link
Member

snnn commented Jul 1, 2024

And python does not use PATH env for searching DLLs

@jillson
Copy link
Author

jillson commented Jul 1, 2024

1.18.1 for cuda 12 requires cudnn 9.* instead of 8.*. See release note: https://github.com/microsoft/onnxruntime/releases/tag/v1.18.1

Hmm.... https://onnxruntime.ai/docs/execution-providers/CUDA-ExecutionProvider.html#requirements apparently needs to be updated to distinguish 1.18.1 (which as you note requires cudnn 9) vs 1.18.0 which worked with cudnn 8.9. Currently trying to download torch 's nightly which I'm hoping will get me cudnn 9 (my attempt to overwrite torch's "vendored" 8.X dlls with the 9.X dlls I had downloaded went about as well as you'd expect) ... if that doesn't work, will likely try to roll back to 1.18.0 binary and see if that then gets things aligned.

Thanks for the reminder about python not using PATH for DLL finding.

@jillson
Copy link
Author

jillson commented Jul 2, 2024

Hmm... now getting OSError: [WinError 126] The specified module could not be found. Error loading "C:\Users\jills\git\stable-diffusion-webui\venv\lib\site-packages\torch\lib\fbgemm.dll" or one of its dependencies. for both the using latest nightly pytorch (which does have Cudnn 9 dlls ) and for, in a different venv, reverting to onnxruntime-gpu==1.18.0 ... at this point, likely best thing to do would be to reinstall the venv, but I'm going to wait until next week due to much slower internet this week.

@jillson jillson closed this as completed Jul 2, 2024
@NulliferBones
Copy link

I'm also unable to use Cuda for reactor. I'm receiving the same error as OP

@jillson
Copy link
Author

jillson commented Jul 10, 2024

Switching to torch for cudnn 8.9 / onxxruntime-gpu==1.18.0 my simple example now works .... but I'm still getting FAIL : LoadLibrary failed with error 126 when trying to load onnxruntime_providers_cuda.dll in stable diffusion which is what I care about...

@jillson
Copy link
Author

jillson commented Jul 10, 2024

And looking more closely at the error, somehow my virtualenv has gotten dorked up under git bash, leading to it finding (stale) dll's in global python over the ones in the venv; switching to powershell/cmd and that seems to now finally use cuda for stable diffusion (or at least not throw errors and fall back to CPU ... but I'm still seeing it take way longer than I'd like to run AND perfmon indicates 0% GPU utilization ... sigh )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ep:CUDA issues related to the CUDA execution provider platform:windows issues related to the Windows platform
Projects
None yet
Development

No branches or pull requests

5 participants