-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
accuracy reduced with multithreaded GPU prediction #15038
Comments
If your model has some operator need accumulation (like Softmax, LayerNormalization etc), CUDA result could be slightly different if partition has changed. Even without multi-threading, you can observe this when you run same inputs multiple times, and measure the variance of outputs. I guess Multithread GPU prediction might cause GPU changes its partition more frequently. For example, when some cores are used by other thread, then GPU might schedule less cores for new requests. That might cause minor change in accuracy. Another possible cause is convolution algo tuning, which might depend on GPU memory free space. If you use multi-threading, that means each thread might use less GPU memory since some memory is used by other threads, then convolution algo might change because some algo might need more memory to run. Unlike PyTorch, ORT does not have option to choose deterministic algo right now, so nondeterministic algorithms might be selected. |
@mg-yolo-enterprises, could you try the following: Create multiple inference sessions of the model, and parallel inference of these sessions. No parallel within each session: sequential inference of images within a session. If it could reproduce accuracy loss, the root cause is what I described previously. |
I ended up putting a simple Lock() around the call to session.Run, which eliminated the problem I was experiencing of accuracy reduction during parallel inferences, without sacrificing any performance - probably because the preprocessing steps are my main bottleneck. |
Describe the issue
A dataset of 20k images was used to perform transfer learning on a MobileNetV2 TF image classifier using
https://github.com/tensorflow/hub/tree/master/tensorflow_hub/tools/make_image_classifier
...which was converted to ONNX format using
https://github.com/onnx/tensorflow-onnx
The resulting model is being consumed using code provided in
https://onnxruntime.ai/docs/get-started/with-csharp.html
The model performs tremendously well, achieving 100% accurate predictions over the entire dataset. Individual prediction scores average 95% for all images.
To improve the inference speed, the following changes were made:
Based on the answer provided to #114 I assumed the InferenceSession was threadsafe and thus didn't worry about locking it or creating a session pool.
The resulting speed increase is significant, as shown below:
Times listed above on Intel i7-12850HX, NVIDIA RTX A2000 Laptop GPU. Times include loading image from file, Bitmap resize operation, construction of Tensor, and call to Session.Run().
Surprisingly, it was discovered that only the first 3 scenarios listed above resulted in 100% accuracy of all model predictions. In the fourth case (GPU and Parallel.ForEach), a fairly random number of predictions will be false negatives or positives. The number is generally in the single-digits (over 20,000 total predictions), but not consistent from one run to the next. The resulting score given to the incorrect prediction is always around 50%, whereas the average score for accurate predictions is in the mid 90s.
Is there any reason why running many predictions in parallel while using the GPU could produce a prediction every so often that is wrong?
To reproduce
Model:
model.onnx.zip
Code provided below:
Urgency
No response
Platform
Windows
OS Version
Windows 11 22H2
ONNX Runtime Installation
Released Package
ONNX Runtime Version or Commit ID
1.14.1
ONNX Runtime API
C#
Architecture
X64
Execution Provider
Default CPU, CUDA
Execution Provider Library Version
CUDA 11.6, cuDNN 8.5.0.96
The text was updated successfully, but these errors were encountered: