[CUDA][Performance] Inference time greatly variates during session run #21966
Labels
ep:CUDA
issues related to the CUDA execution provider
performance
issues related to performance regressions
stale
issues that have not been addressed in a while; categorized by a bot
Describe the issue
During session run with CUDAExecutionProvider i noticed that the inference time has great variations based on whether or not i use other application on my computer.
For example, when i run my code without any other programs running inference time is around 490ms, but if i open for example a chatting app and i start writing a message the inference will jump from 490ms up to 1600ms. If i stop using the other app and go back to my code inference time returns to normal values.
This is shown in the log below where i recorded inference time for each iteration, as i use my model inside a loop:
GPU_2070_inference_time.txt
I also recorded the memory usage during runtime to check if it increases when i use other apps, but as you can see below there are no signs of memory leaks:
Test was done on
NVIDIA RTX 2070 series , with 8GB GPU memory and 16GB RAM
.Has anyone encountered this type of behavior before? Is there a way to solve this?
I don't think that it's normal for the inference time to be this much affected by a simple task as writing a message.
To reproduce
Sadly, I'm not allowed to share the code or model as this is for a work project.
Urgency
Important
Platform
Windows
OS Version
10
ONNX Runtime Installation
Released Package
ONNX Runtime Version or Commit ID
1.15.0 but i also tested 1.13.1 and 1.16.0, and other versions
ONNX Runtime API
C++
Architecture
X64
Execution Provider
CUDA
Execution Provider Library Version
11.8
Model File
No response
Is this a quantized model?
No
The text was updated successfully, but these errors were encountered: