[Error] [ONNXRuntimeError] : 1 : FAIL : CUDA failure 3: initialization error #21368

phamkhactu · 2024-07-16T10:17:53Z

Describe the issue

I tested my model successfully in my local machine with same torch version. However, I have error when I load model in docker image.

  File "/opt/conda/lib/python3.10/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 217, in run
    return self._sess.run(output_names, input_feed, run_options)
onnxruntime.capi.onnxruntime_pybind11_state.Fail: [ONNXRuntimeError] : 1 : FAIL : CUDA failure 3: initialization error ; GPU=32764 ; hostname=dcd468bff115 ; file=/onnxruntime_src/onnxruntime/core/providers/cuda/cuda_execution_provider.cc ; line=388 ; expr=cudaSetDevice(GetDeviceId());

To reproduce

Docker image: pytorch/pytorch:2.1.0-cuda11.8-cudnn8-devel
onnxruntime-gpu: onnxruntime-gpu==1.15

with docs provided, I think that I set up right

Urgency

No response

Platform

Linux

OS Version

20.04

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.15

ONNX Runtime API

Python

Architecture

X64

Execution Provider

CUDA

Execution Provider Library Version

Cuda 11.8

The text was updated successfully, but these errors were encountered:

tianleiwu · 2024-07-17T01:20:44Z

I did not reproduce the issue. Here is what I try:

docker run --rm -it --gpus all -v $PWD:/workspace pytorch/pytorch:2.1.0-cuda11.8-cudnn8-devel /bin/bash

Then, run the following command line in docker

pip install onnxruntime-gpu==1.15

Finally, test an onnx model using python:

import torch # If you do not import torch,  you will need install cuddn 8.* and set path properly.
import onnxruntime
session = onnxruntime.InferenceSession("matmul_1.onnx", providers=['CUDAExecutionProvider', 'CPUExecutionProvider'])
session.get_providers()

Everything seems good.

phamkhactu · 2024-07-17T06:40:27Z

Okie, I will check it again.

Thank you

phamkhactu · 2024-07-17T22:13:17Z

Hi @tianleiwu,

After many times debug in local machine and container docker, with the same code in both env, I saw that:

Model can be initialized successfully in local machine
Failed in container docker

Here is my found:
I have a init.py where I init my global model and in app.py is a FastAPI using gunicorn to serving. The error comes from every time I call /tts in get_audio function

def get_audio(sentences, speaker_id):
    wavs = []
    for sent in sentences:
        if sent.strip().rstrip() =="":
            continue
        if sent == "########":
            silence_duration = int(0.15 * 22050)
            silence = np.zeros(silence_duration)
            wavs.append(silence)
            continue
        wav = init.model(sent, speaker_id)
        wavs.append(wav)
        silence_duration = int(0.1 * 22050)
        silence = np.zeros(silence_duration)
        wavs.append(silence)

    wav = np.concatenate(tuple(wavs))
    return wav


@app.post("/tts")
async def tts(item: Text2SpeechItem):
    text = item.text
    speaker_id = item.speaker_id
    if text =="":
        return 400
    # torch.cuda.empty_cache()
    sentences = helpers.clean_text(text)
    logger.info("*"*50)
    logger.info(sentences)
    logger.info("*"*50)
    wav = get_audio(sentences, speaker_id=speaker_id)
   
    with tempfile.NamedTemporaryFile(delete=False) as tmp:
        wavfile.write(tmp.name, rate=22050, data=wav.astype(np.int16))
        torch.cuda.empty_cache()
        return FileResponse(tmp.name, media_type='audio/mp3', filename=tmp.name)

# uvicorn.run(app, host="0.0.0.0", port=6688)
t = threading.Thread(target=setting_env)
t.start()

what do you think this issue comes from? And any your suggestion?

phamkhactu · 2024-07-18T04:20:56Z

I spent more time and tried others serving. After that, I found uvicorn can work well.

github-actions bot added the ep:CUDA issues related to the CUDA execution provider label Jul 16, 2024

phamkhactu closed this as completed Jul 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Error] [ONNXRuntimeError] : 1 : FAIL : CUDA failure 3: initialization error #21368

[Error] [ONNXRuntimeError] : 1 : FAIL : CUDA failure 3: initialization error #21368

phamkhactu commented Jul 16, 2024 •

edited

Loading

tianleiwu commented Jul 17, 2024 •

edited

Loading

phamkhactu commented Jul 17, 2024

phamkhactu commented Jul 17, 2024

phamkhactu commented Jul 18, 2024

[Error] [ONNXRuntimeError] : 1 : FAIL : CUDA failure 3: initialization error #21368

[Error] [ONNXRuntimeError] : 1 : FAIL : CUDA failure 3: initialization error #21368

Comments

phamkhactu commented Jul 16, 2024 • edited Loading

Describe the issue

To reproduce

Urgency

Platform

OS Version

ONNX Runtime Installation

ONNX Runtime Version or Commit ID

ONNX Runtime API

Architecture

Execution Provider

Execution Provider Library Version

tianleiwu commented Jul 17, 2024 • edited Loading

phamkhactu commented Jul 17, 2024

phamkhactu commented Jul 17, 2024

phamkhactu commented Jul 18, 2024

phamkhactu commented Jul 16, 2024 •

edited

Loading

tianleiwu commented Jul 17, 2024 •

edited

Loading