Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Performance] Why run first inference so slow, although run one time in initialzation? #19177

Open
nistarlwc opened this issue Jan 17, 2024 · 12 comments
Labels
ep:CUDA issues related to the CUDA execution provider platform:windows issues related to the Windows platform

Comments

@nistarlwc
Copy link

nistarlwc commented Jan 17, 2024

Describe the issue

Build a class to create the model and inference. In initialition, created a random data and run one time.
But when run other data, first inference is so slow, Why?
If wait some seconds, then run the next data, it will be slow.

To reproduce

class SemanticSegment(object):
    def __init__(self, model_path):
        sess_providers = ['CUDAExecutionProvider']
        sess_options = rt.SessionOptions()
        self.session = rt.InferenceSession(model_path, sess_options, sess_providers)
        self.input_name = self.session.get_inputs()[0].name
        zero_image = np.zeros([BATCH_SIZE, SEG_SIZE_H, SEG_SIZE_W, 3], dtype=np.uint8)
        _ = self.session.run(None, {self.input_name: zero_image})

    def predict(self, image):
        input_tensor = np.expand_dims(image, axis=0)
        prediction = self.session.run(None, {self.input_name: input_tensor})[0]
        return prediction 

Urgency

No response

Platform

Windows

OS Version

win10

ONNX Runtime Installation

Built from Source

ONNX Runtime Version or Commit ID

onnxruntime-gpu==1.15

ONNX Runtime API

Python

Architecture

X64

Execution Provider

CUDA

Execution Provider Library Version

CUDA 11.8

Model File

No response

Is this a quantized model?

No

@github-actions github-actions bot added ep:CUDA issues related to the CUDA execution provider platform:windows issues related to the Windows platform labels Jan 17, 2024
@xadupre
Copy link
Member

xadupre commented Jan 17, 2024

To be more precise, the first call to method predict (so second call to sesssion.run) is still much slower than the other call? Are you using GPU for the inference? (the first call is using CPU in the constructor). It may be the cause. onnxruntime optimizes inference with CPU on the first call but has to start again with the second call (using CUDA).

@nistarlwc
Copy link
Author

@xadupre Thank you for you reply

The first call to sesssion.run is slower than the second call to sesssion.run .
And the second call to sesssion.run is slower than the third call to sesssion.run .
After the 4th or 5th time, the call can be stable to same time.
If wait some seconds, then the next call will be slower.

Only use CUDA to predict, can look the code, sess_providers = ['CUDAExecutionProvider']

To be more precise, the first call to method predict (so second call to sesssion.run) is still much slower than the other call? Are you using GPU for the inference? (the first call is using CPU in the constructor). It may be the cause. onnxruntime optimizes inference with CPU on the first call but has to start again with the second call (using CUDA).

@xadupre
Copy link
Member

xadupre commented Jan 18, 2024

For the first iteration, you data is copied from cpu to gpu. Maybe that's not the case for the others. CUDA is usually faster after a few iterations (warm-up). Benchmarks on CUDA usually expose a warmup parameter to take that effect into account.

@hariharans29
Copy link
Member

Are the image dimensions fixed or bound to vary ? Please see this issue if your image dimensions are dynamic and bound to vary to optimize for this use-case. Also see this related documentation.

In general, the first inference run is expected to be a lot slower than the second run as the first run is where most CUDA memory allocations happen (this is costly) and cached in the memory pool for subsequent runs. Ensure that the warm up run (first run) you do is for the same image shape as the subsequent runs if the image size is fixed. If you do this, the second run shouldn't be a lot slower than the third run (assuming image dimensions are fixed between second and third calls). If you have ensured all the above, how slow is the second inference call relative to the third call ?

'If wait some seconds, then the next call will be slower.' - Are you saying that if there is a delay introduced between runs, then inference runs are slower? If so, please see this issue

@nistarlwc
Copy link
Author

@xadupre
I think that the problem is warm-up too, but how to solve?
In the project, the run time is very importent.

@nistarlwc
Copy link
Author

@hariharans29 Thank you for you reply
The image dimensions are fixed.
Try to set GPU power, like
156086029-264b495c-38c9-414d-b77a-0012f9d5c43e
But the run time is not improved

@nistarlwc
Copy link
Author

The test results:

Use 300 images for a Iteration
fisr Iteration:
run time : 110.5
run time : 79.6
run time : 54.3
run time : 6.9
run time : 6.9
run time : 6.9
......

wait 2s and run second Iteration:
run time : 57.8
run time : 56.8
run time : 58.8
run time : 6.9
run time : 6.9
run time : 6.9
......

@nistarlwc
Copy link
Author

@xadupre @hariharans29 Help!!!
The problem is very serious.
Sometimes first ~10th predict will be very slow.

I try to test with tensorflow, but don't have the problem.
I think the problem is static graph and dynamic graph.
But how to use onnxruntime with staticgraph?

@xadupre
Copy link
Member

xadupre commented Jan 29, 2024

Is it possible to share the full script you use to run your benchmark?

@nistarlwc
Copy link
Author

@xadupre
Copy link
Member

xadupre commented Jan 31, 2024

So you run onnxruntime in a multithreaded environment. Based on your code, you have one instance of onnxruntime potentially called from multiple threads. onnxruntime is designed to use all the cores by default. Python should avoid mutliple calls to onnxruntiume at the same time (GIL) but maybe onnxruntime is changing the way it manages the memory if it detects multiple threads coming in. Maybe @hariharans29 knows more about that.

@nistarlwc
Copy link
Author

nistarlwc commented Feb 1, 2024

@xadupre @hariharans29, although the httpserver is used with the multi-threading model, when onnxruntime is called, the images are predicted one by one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ep:CUDA issues related to the CUDA execution provider platform:windows issues related to the Windows platform
Projects
None yet
Development

No branches or pull requests

3 participants