[Training] Error when import OnnxRuntime-Training #18171

IzanCatalan · 2023-10-30T16:47:04Z

Describe the issue

I am using onnxruntime-training from the source. I compiled and successfully created the Python wheel for cuda 11.2 and installed it with pip. I archived this in the source machine where I created the wheel and in another machine with the same Cuda version and configuration (in this case, without building again onnxruntime, just copying the wheel files to the other machine and using pip to install them). I also configured the new machine with the same Cuda Home and LD_LIBRARY_PATH variables.
However, when I try to execute my python script on the new machine, it always fails because it cannot find libcublasLt.so.11 file:

No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda-11.2/'
/mnt/beegfs/gap/izcagal/.conda/envs/onnx/lib/python3.8/site-packages/onnxruntime/training/utils/hooks/_zero_offload_subscriber.py:118: UserWarning: DeepSpeed import error No module named 'deepspeed'
  warnings.warn(f"DeepSpeed import error {e}")
STARTING
RUNNING
DATASET LOADED
Traceback (most recent call last):
  File "trainCMTS.py", line 435, in <module>
    model = orttraining.Module(
  File "/mnt/beegfs/gap/izcagal/.conda/envs/onnx/lib/python3.8/site-packages/onnxruntime/training/api/module.py", line 60, in __init__
    self._model = C.Module(
RuntimeError: /home/onnxruntime/onnxruntime/core/session/provider_bridge_ort.cc:1193 onnxruntime::Provider& onnxruntime::ProviderLibrary::Get() [ONNXRuntimeError] : 1 : FAIL : Failed to load library libonnxruntime_providers_cuda.so with error: libcublasLt.so.11: cannot open shared object file: No such file or directory

Moreover, a little message saying that no Cuda runtime is found appears, but it also detects the right Cuda home path, which inside of it (/usr/local/cuda11.2/lib64) is the libcublasLt.so.11 file.

I don't quite to understand how it detects right the cuda home path and does not find the file which is inside because I also tried to execute export CUDA_HOME=/usr/local/cuda-11.2/lib64 before executing the script, and it failed again.

Any help would be very appreciated.

To reproduce

Build onnxruntime-training from source with cuda 11.2, cmake 3.27, GCC 9.5 and python 3.8.

Urgency

No response

ONNX Runtime Installation

Built from Source

ONNX Runtime Version or Commit ID

onnxruntime-training 1.17.0+cu112

PyTorch Version

None

Execution Provider

CUDA

Execution Provider Library Version

Cuda 11.2

The text was updated successfully, but these errors were encountered:

askhade · 2023-10-30T19:27:31Z

can you try: ldd libonnxruntime_providers_cuda.so this may give some hints.
Also please make sure all your paths are set correctly. Check libcublasLT.so.11 is indeed in the location that you expect and add it to the LD_LIBRARY_PATH as well.

@snnn: Do you have any suggestions?

IzanCatalan · 2023-10-31T15:54:12Z

@askhade I successfully execute now my python script. However, in order to execute it first I had to export some variables:

 1990  export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-11.2/lib64:/mnt/beegfs/gap/izcagal/glibc229/lib:/usr/local/cuda/targets/x86_64-linux/lib:/usr/lib/x86_64-linux-gnu
 1991  export PATH=/usr/local/cuda-11.2/bin:$PATH
 1992  export CUDA_HOME=/usr/local/cuda-11.2

Now I can train a model, however in the test validation phase, after the training code, I get now an error:

  File "trainCMTS.py", line 383, in test
    test_loss, logits = model(*forward_inputs)
  File "/mnt/beegfs/gap/izcagal/.conda/envs/onnx/lib/python3.8/site-packages/onnxruntime/training/api/module.py", line 113, in __call__
    return _take_generic_step([*user_inputs])
  File "/mnt/beegfs/gap/izcagal/.conda/envs/onnx/lib/python3.8/site-packages/onnxruntime/training/api/module.py", line 87, in _take_generic_step
    self._model.eval_step(forward_inputs, fetches)
RuntimeError: /home/onnxruntime/orttraining/orttraining/training_api/module.cc:553 onnxruntime::common::Status onnxruntime::training::api::Module::EvalStep(const std::vector<OrtValue>&, std::vector<OrtValue>&) [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Non-zero status code returned while running Conv node. Name:'resnetv17_conv0_fwd' Status Message: /home/onnxruntime/onnxruntime/core/framework/bfc_arena.cc:376 void* onnxruntime::BFCArena::AllocateRawInternal(size_t, bool, onnxruntime::Stream*, bool, onnxruntime::WaitNotificationFn) Failed to allocate memory for requested buffer of size 822083584

The part of my code which throws the error is the following:

    model.eval()
    print("setup eval")
    losses = []
    metric = evaluate.load('accuracy')

    for _, (data, target) in enumerate(val_loader):
        forward_inputs = [data.reshape(len(data),3,224,224).numpy().astype(np.float32),target.numpy().astype(np.int64)] 
        test_loss, logits = model(*forward_inputs) -> error
        metric.add_batch(references=target, predictions=get_pred(logits))
        losses.append(test_loss)

I don't understand exactly why this happens. I guess it is due to the batch size, but when I set up the dataset, I chose the same batch size for the training dataset and the validate/test dataset and only failed with the validate/test dataset.

imagenet_data_train = datasets.ImageNet('/mnt/beegfs/gap/izcagal/docker/Imagenet', split="train", transform=transform_test)
imagenet_data_val = datasets.ImageNet('/mnt/beegfs/gap/izcagal/docker/Imagenet', split="val", transform=transform_test)
train_loader = torch.utils.data.DataLoader(imagenet_data_train,
                                          batch_size=256,
                                          shuffle=True,
                                          num_workers=16)

val_loader = torch.utils.data.DataLoader(imagenet_data_val,
                                          batch_size=256,
                                          shuffle=True,
                                          num_workers=16)

Do you have any thoughts about why this error happens or how it affects the num_workers and the batch_size to the test buffer?

github-actions · 2023-12-01T15:01:01Z

This issue has been automatically marked as stale due to inactivity and will be closed in 7 days if no further activity occurs. If further support is needed, please provide an update and/or more details.

github-actions · 2024-01-01T15:00:58Z

This issue has been automatically closed due to inactivity. Please reactivate if further support is needed.

IzanCatalan added the training issues related to ONNX Runtime training; typically submitted using template label Oct 30, 2023

github-actions bot added the ep:CUDA issues related to the CUDA execution provider label Oct 30, 2023

github-actions bot added the stale issues that have not been addressed in a while; categorized by a bot label Dec 1, 2023

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jan 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Training] Error when import OnnxRuntime-Training #18171

[Training] Error when import OnnxRuntime-Training #18171

IzanCatalan commented Oct 30, 2023

askhade commented Oct 30, 2023

IzanCatalan commented Oct 31, 2023

github-actions bot commented Dec 1, 2023

github-actions bot commented Jan 1, 2024

[Training] Error when import OnnxRuntime-Training #18171

[Training] Error when import OnnxRuntime-Training #18171

Comments

IzanCatalan commented Oct 30, 2023

Describe the issue

To reproduce

Urgency

ONNX Runtime Installation

ONNX Runtime Version or Commit ID

PyTorch Version

Execution Provider

Execution Provider Library Version

askhade commented Oct 30, 2023

IzanCatalan commented Oct 31, 2023

github-actions bot commented Dec 1, 2023

github-actions bot commented Jan 1, 2024