-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Performance] Why run first inference so slow, although run one time in initialzation? #19177
Comments
To be more precise, the first call to method |
@xadupre Thank you for you reply The first call to Only use CUDA to predict, can look the code, sess_providers = ['CUDAExecutionProvider']
|
For the first iteration, you data is copied from cpu to gpu. Maybe that's not the case for the others. CUDA is usually faster after a few iterations (warm-up). Benchmarks on CUDA usually expose a warmup parameter to take that effect into account. |
Are the image dimensions fixed or bound to vary ? Please see this issue if your image dimensions are dynamic and bound to vary to optimize for this use-case. Also see this related documentation. In general, the first inference run is expected to be a lot slower than the second run as the first run is where most CUDA memory allocations happen (this is costly) and cached in the memory pool for subsequent runs. Ensure that the warm up run (first run) you do is for the same image shape as the subsequent runs if the image size is fixed. If you do this, the second run shouldn't be a lot slower than the third run (assuming image dimensions are fixed between second and third calls). If you have ensured all the above, how slow is the second inference call relative to the third call ? 'If wait some seconds, then the next call will be slower.' - Are you saying that if there is a delay introduced between runs, then inference runs are slower? If so, please see this issue |
@xadupre |
@hariharans29 Thank you for you reply |
The test results: Use 300 images for a Iteration wait 2s and run second Iteration: |
@xadupre @hariharans29 Help!!! I try to test with tensorflow, but don't have the problem. |
Is it possible to share the full script you use to run your benchmark? |
So you run onnxruntime in a multithreaded environment. Based on your code, you have one instance of onnxruntime potentially called from multiple threads. onnxruntime is designed to use all the cores by default. Python should avoid mutliple calls to onnxruntiume at the same time (GIL) but maybe onnxruntime is changing the way it manages the memory if it detects multiple threads coming in. Maybe @hariharans29 knows more about that. |
@xadupre @hariharans29, although the httpserver is used with the multi-threading model, when onnxruntime is called, the images are predicted one by one. |
Describe the issue
Build a class to create the model and inference. In initialition, created a random data and run one time.
But when run other data, first inference is so slow, Why?
If wait some seconds, then run the next data, it will be slow.
To reproduce
Urgency
No response
Platform
Windows
OS Version
win10
ONNX Runtime Installation
Built from Source
ONNX Runtime Version or Commit ID
onnxruntime-gpu==1.15
ONNX Runtime API
Python
Architecture
X64
Execution Provider
CUDA
Execution Provider Library Version
CUDA 11.8
Model File
No response
Is this a quantized model?
No
The text was updated successfully, but these errors were encountered: