Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Training] Error when import OnnxRuntime-Training #18171

Closed
IzanCatalan opened this issue Oct 30, 2023 · 4 comments
Closed

[Training] Error when import OnnxRuntime-Training #18171

IzanCatalan opened this issue Oct 30, 2023 · 4 comments
Labels
ep:CUDA issues related to the CUDA execution provider stale issues that have not been addressed in a while; categorized by a bot training issues related to ONNX Runtime training; typically submitted using template

Comments

@IzanCatalan
Copy link

Describe the issue

I am using onnxruntime-training from the source. I compiled and successfully created the Python wheel for cuda 11.2 and installed it with pip. I archived this in the source machine where I created the wheel and in another machine with the same Cuda version and configuration (in this case, without building again onnxruntime, just copying the wheel files to the other machine and using pip to install them). I also configured the new machine with the same Cuda Home and LD_LIBRARY_PATH variables.
However, when I try to execute my python script on the new machine, it always fails because it cannot find libcublasLt.so.11 file:

No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda-11.2/'
/mnt/beegfs/gap/izcagal/.conda/envs/onnx/lib/python3.8/site-packages/onnxruntime/training/utils/hooks/_zero_offload_subscriber.py:118: UserWarning: DeepSpeed import error No module named 'deepspeed'
  warnings.warn(f"DeepSpeed import error {e}")
STARTING
RUNNING
DATASET LOADED
Traceback (most recent call last):
  File "trainCMTS.py", line 435, in <module>
    model = orttraining.Module(
  File "/mnt/beegfs/gap/izcagal/.conda/envs/onnx/lib/python3.8/site-packages/onnxruntime/training/api/module.py", line 60, in __init__
    self._model = C.Module(
RuntimeError: /home/onnxruntime/onnxruntime/core/session/provider_bridge_ort.cc:1193 onnxruntime::Provider& onnxruntime::ProviderLibrary::Get() [ONNXRuntimeError] : 1 : FAIL : Failed to load library libonnxruntime_providers_cuda.so with error: libcublasLt.so.11: cannot open shared object file: No such file or directory

Moreover, a little message saying that no Cuda runtime is found appears, but it also detects the right Cuda home path, which inside of it (/usr/local/cuda11.2/lib64) is the libcublasLt.so.11 file.

I don't quite to understand how it detects right the cuda home path and does not find the file which is inside because I also tried to execute export CUDA_HOME=/usr/local/cuda-11.2/lib64 before executing the script, and it failed again.

Any help would be very appreciated.

To reproduce

Build onnxruntime-training from source with cuda 11.2, cmake 3.27, GCC 9.5 and python 3.8.

Urgency

No response

ONNX Runtime Installation

Built from Source

ONNX Runtime Version or Commit ID

onnxruntime-training 1.17.0+cu112

PyTorch Version

None

Execution Provider

CUDA

Execution Provider Library Version

Cuda 11.2

@IzanCatalan IzanCatalan added the training issues related to ONNX Runtime training; typically submitted using template label Oct 30, 2023
@github-actions github-actions bot added the ep:CUDA issues related to the CUDA execution provider label Oct 30, 2023
@askhade
Copy link
Contributor

askhade commented Oct 30, 2023

can you try: ldd libonnxruntime_providers_cuda.so this may give some hints.
Also please make sure all your paths are set correctly. Check libcublasLT.so.11 is indeed in the location that you expect and add it to the LD_LIBRARY_PATH as well.

@snnn: Do you have any suggestions?

@IzanCatalan
Copy link
Author

@askhade I successfully execute now my python script. However, in order to execute it first I had to export some variables:

 1990  export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-11.2/lib64:/mnt/beegfs/gap/izcagal/glibc229/lib:/usr/local/cuda/targets/x86_64-linux/lib:/usr/lib/x86_64-linux-gnu
 1991  export PATH=/usr/local/cuda-11.2/bin:$PATH
 1992  export CUDA_HOME=/usr/local/cuda-11.2

Now I can train a model, however in the test validation phase, after the training code, I get now an error:

  File "trainCMTS.py", line 383, in test
    test_loss, logits = model(*forward_inputs)
  File "/mnt/beegfs/gap/izcagal/.conda/envs/onnx/lib/python3.8/site-packages/onnxruntime/training/api/module.py", line 113, in __call__
    return _take_generic_step([*user_inputs])
  File "/mnt/beegfs/gap/izcagal/.conda/envs/onnx/lib/python3.8/site-packages/onnxruntime/training/api/module.py", line 87, in _take_generic_step
    self._model.eval_step(forward_inputs, fetches)
RuntimeError: /home/onnxruntime/orttraining/orttraining/training_api/module.cc:553 onnxruntime::common::Status onnxruntime::training::api::Module::EvalStep(const std::vector<OrtValue>&, std::vector<OrtValue>&) [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Non-zero status code returned while running Conv node. Name:'resnetv17_conv0_fwd' Status Message: /home/onnxruntime/onnxruntime/core/framework/bfc_arena.cc:376 void* onnxruntime::BFCArena::AllocateRawInternal(size_t, bool, onnxruntime::Stream*, bool, onnxruntime::WaitNotificationFn) Failed to allocate memory for requested buffer of size 822083584

The part of my code which throws the error is the following:

    model.eval()
    print("setup eval")
    losses = []
    metric = evaluate.load('accuracy')

    for _, (data, target) in enumerate(val_loader):
        forward_inputs = [data.reshape(len(data),3,224,224).numpy().astype(np.float32),target.numpy().astype(np.int64)] 
        test_loss, logits = model(*forward_inputs) -> error
        metric.add_batch(references=target, predictions=get_pred(logits))
        losses.append(test_loss)

I don't understand exactly why this happens. I guess it is due to the batch size, but when I set up the dataset, I chose the same batch size for the training dataset and the validate/test dataset and only failed with the validate/test dataset.

imagenet_data_train = datasets.ImageNet('/mnt/beegfs/gap/izcagal/docker/Imagenet', split="train", transform=transform_test)
imagenet_data_val = datasets.ImageNet('/mnt/beegfs/gap/izcagal/docker/Imagenet', split="val", transform=transform_test)
train_loader = torch.utils.data.DataLoader(imagenet_data_train,
                                          batch_size=256,
                                          shuffle=True,
                                          num_workers=16)

val_loader = torch.utils.data.DataLoader(imagenet_data_val,
                                          batch_size=256,
                                          shuffle=True,
                                          num_workers=16)

Do you have any thoughts about why this error happens or how it affects the num_workers and the batch_size to the test buffer?

Copy link
Contributor

github-actions bot commented Dec 1, 2023

This issue has been automatically marked as stale due to inactivity and will be closed in 7 days if no further activity occurs. If further support is needed, please provide an update and/or more details.

@github-actions github-actions bot added the stale issues that have not been addressed in a while; categorized by a bot label Dec 1, 2023
Copy link
Contributor

github-actions bot commented Jan 1, 2024

This issue has been automatically closed due to inactivity. Please reactivate if further support is needed.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jan 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ep:CUDA issues related to the CUDA execution provider stale issues that have not been addressed in a while; categorized by a bot training issues related to ONNX Runtime training; typically submitted using template
Projects
None yet
Development

No branches or pull requests

2 participants