[Performance] First inference with CUDAExecutionProvider is slow #21541
Labels
ep:CUDA
issues related to the CUDA execution provider
performance
issues related to performance regressions
Describe the issue
I run inference of an image using onnxruntime in Colab T4.
First inference is much slower than following inferences, probably because of loading model (12 sec).
If I run inference more times, inference is very fast (0 sec).
If I use CPUExecutionProvider, it is not so slow as first time Cuda (4 sec).
Can I optimize first inference?
Thanks
To reproduce
Here is my colab code:
!pip install -U torch torchvision torchaudio
!pip install onnxruntime-gpu --extra-index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/onnxruntime-cuda-12/pypi/simple/
Urgency
No response
Platform
Linux
OS Version
ubuntu
ONNX Runtime Installation
Released Package
ONNX Runtime Version or Commit ID
1.18.1
ONNX Runtime API
Python
Architecture
X64
Execution Provider
CUDA
Execution Provider Library Version
CUDA 12.2
Model File
No response
Is this a quantized model?
Unknown
The text was updated successfully, but these errors were encountered: