[Performance] First inference with CUDAExecutionProvider is slow #21541

RomRoc · 2024-07-29T08:53:26Z

Describe the issue

I run inference of an image using onnxruntime in Colab T4.
First inference is much slower than following inferences, probably because of loading model (12 sec).
If I run inference more times, inference is very fast (0 sec).

If I use CPUExecutionProvider, it is not so slow as first time Cuda (4 sec).

Can I optimize first inference?
Thanks

To reproduce

Here is my colab code:

!pip install -U torch torchvision torchaudio
!pip install onnxruntime-gpu --extra-index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/onnxruntime-cuda-12/pypi/simple/

import onnxruntime
import cv2
import numpy as np
from google.colab.patches import cv2_imshow

model_path = 'output/model.onnx'
image_path = 'img.jpg'

session = onnxruntime.InferenceSession(model_path, providers=['CUDAExecutionProvider'])
#session = onnxruntime.InferenceSession(model_path, providers=['CPUExecutionProvider'])

image = cv2.imread(image_path)
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
image = np.transpose(image, (2, 0, 1))
image = image.astype(np.float32) / 255.0
image -= 0.5
image /= 0.5
image = image[np.newaxis, ...]

result = session.run(None, {'x': image})[0][0]

cv2_imshow(result*255)

Urgency

No response

Platform

Linux

OS Version

ubuntu

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.18.1

ONNX Runtime API

Python

Architecture

X64

Execution Provider

CUDA

Execution Provider Library Version

CUDA 12.2

Model File

No response

Is this a quantized model?

Unknown

The text was updated successfully, but these errors were encountered:

tianleiwu · 2024-07-29T17:43:18Z

Image model usually have convolution, and cuDNN will need time to find the best algo. You can try set cudnn_conv_algo_search to HEURISTIC (see examples) in cuda provider option to see whether it could reduce the latency.

To improve further, it need serialize the algo search result to file so that it can reuse in next session. Right now, it is not supported yet.

RomRoc · 2024-07-31T10:01:44Z

Thanks!
Setting HEURISTIC or DEFAULT definitively solve the issue, inference is < 1 sec.
Here is the code:
session = onnxruntime.InferenceSession(model_path, providers=[('CUDAExecutionProvider', {'cudnn_conv_algo_search': 'DEFAULT'})])

As proposed in other issue #19839 it could be useful to make it clear in documentation, or make it the default option.

Thanks for quick and reliable answer.

RomRoc added the performance issues related to performance regressions label Jul 29, 2024

github-actions bot added the ep:CUDA issues related to the CUDA execution provider label Jul 29, 2024

RomRoc closed this as completed Jul 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance] First inference with CUDAExecutionProvider is slow #21541

[Performance] First inference with CUDAExecutionProvider is slow #21541

RomRoc commented Jul 29, 2024

tianleiwu commented Jul 29, 2024 •

edited

Loading

RomRoc commented Jul 31, 2024

[Performance] First inference with CUDAExecutionProvider is slow #21541

[Performance] First inference with CUDAExecutionProvider is slow #21541

Comments

RomRoc commented Jul 29, 2024

Describe the issue

To reproduce

Urgency

Platform

OS Version

ONNX Runtime Installation

ONNX Runtime Version or Commit ID

ONNX Runtime API

Architecture

Execution Provider

Execution Provider Library Version

Model File

Is this a quantized model?

tianleiwu commented Jul 29, 2024 • edited Loading

RomRoc commented Jul 31, 2024

tianleiwu commented Jul 29, 2024 •

edited

Loading