-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Training] Error when import OnnxRuntime-Training #18171
Comments
can you try: ldd libonnxruntime_providers_cuda.so this may give some hints. @snnn: Do you have any suggestions? |
@askhade I successfully execute now my python script. However, in order to execute it first I had to export some variables:
Now I can train a model, however in the test validation phase, after the training code, I get now an error:
The part of my code which throws the error is the following:
I don't understand exactly why this happens. I guess it is due to the batch size, but when I set up the dataset, I chose the same batch size for the training dataset and the validate/test dataset and only failed with the validate/test dataset.
Do you have any thoughts about why this error happens or how it affects the num_workers and the batch_size to the test buffer? |
This issue has been automatically marked as stale due to inactivity and will be closed in 7 days if no further activity occurs. If further support is needed, please provide an update and/or more details. |
This issue has been automatically closed due to inactivity. Please reactivate if further support is needed. |
Describe the issue
I am using onnxruntime-training from the source. I compiled and successfully created the Python wheel for cuda 11.2 and installed it with pip. I archived this in the source machine where I created the wheel and in another machine with the same Cuda version and configuration (in this case, without building again onnxruntime, just copying the wheel files to the other machine and using pip to install them). I also configured the new machine with the same Cuda Home and LD_LIBRARY_PATH variables.
However, when I try to execute my python script on the new machine, it always fails because it cannot find libcublasLt.so.11 file:
Moreover, a little message saying that no Cuda runtime is found appears, but it also detects the right Cuda home path, which inside of it (/usr/local/cuda11.2/lib64) is the libcublasLt.so.11 file.
I don't quite to understand how it detects right the cuda home path and does not find the file which is inside because I also tried to execute export CUDA_HOME=/usr/local/cuda-11.2/lib64 before executing the script, and it failed again.
Any help would be very appreciated.
To reproduce
Build onnxruntime-training from source with cuda 11.2, cmake 3.27, GCC 9.5 and python 3.8.
Urgency
No response
ONNX Runtime Installation
Built from Source
ONNX Runtime Version or Commit ID
onnxruntime-training 1.17.0+cu112
PyTorch Version
None
Execution Provider
CUDA
Execution Provider Library Version
Cuda 11.2
The text was updated successfully, but these errors were encountered: