[Performance] Multiple instances of the same model are slower #22778
Labels
performance
issues related to performance regressions
stale
issues that have not been addressed in a while; categorized by a bot
Describe the issue
Running a model for N iterations in a single ONNX session is way faster than running the same model in 2 independent sessions, each session is run for N/2 iterations each.
¿What's going on? I'm writing a realtime inference pipeline that involves two models (M1 and M2) at the same time. One of the models, M1, is quite small and has a fixed batch-size of 1. On the other hand, the input for M2 is relies in a sequence of M1 outputs bundled together... This requirement is hindered by the batch-size of 1 of the first model, as i have to run multiple inference requests instead of a single one, increasing the overall inference latency.
In this quest to reduce the latency, i tried tinkering around with multiple instances of the same model, as the model is quite small and maybe it could benefit from a multi-threading speed-up or something... But i found the opposite.
Running a single instance took on average 2.32ms
Running 2 instances took on average 9.29ms
¿What's going on? And more importantly, ¿Does having two sessions with two independent models (M1 and M2 pipeline) affect the overall performance of ONNX?
To reproduce
Tinker with
n_models = 2
single_model = False
Urgency
No response
Platform
Windows
OS Version
11
ONNX Runtime Installation
Released Package
ONNX Runtime Version or Commit ID
onnxruntime-gpu==1.19.2
ONNX Runtime API
Python
Architecture
X64
Execution Provider
CUDA
Execution Provider Library Version
CUDA 11.7 + RTX3060 12GB
Model File
Replicated with a private model and MobilenetV2
Is this a quantized model?
No
The text was updated successfully, but these errors were encountered: