[Performance] Why run first inference so slow, although run one time in initialzation? #19177

nistarlwc · 2024-01-17T06:24:32Z

Describe the issue

Build a class to create the model and inference. In initialition, created a random data and run one time.
But when run other data, first inference is so slow, Why?
If wait some seconds, then run the next data, it will be slow.

To reproduce

class SemanticSegment(object):
    def __init__(self, model_path):
        sess_providers = ['CUDAExecutionProvider']
        sess_options = rt.SessionOptions()
        self.session = rt.InferenceSession(model_path, sess_options, sess_providers)
        self.input_name = self.session.get_inputs()[0].name
        zero_image = np.zeros([BATCH_SIZE, SEG_SIZE_H, SEG_SIZE_W, 3], dtype=np.uint8)
        _ = self.session.run(None, {self.input_name: zero_image})

    def predict(self, image):
        input_tensor = np.expand_dims(image, axis=0)
        prediction = self.session.run(None, {self.input_name: input_tensor})[0]
        return prediction

Urgency

No response

Platform

Windows

OS Version

win10

ONNX Runtime Installation

Built from Source

ONNX Runtime Version or Commit ID

onnxruntime-gpu==1.15

ONNX Runtime API

Python

Architecture

X64

Execution Provider

CUDA

Execution Provider Library Version

CUDA 11.8

Model File

No response

Is this a quantized model?

No

xadupre · 2024-01-17T15:55:30Z

To be more precise, the first call to method predict (so second call to sesssion.run) is still much slower than the other call? Are you using GPU for the inference? (the first call is using CPU in the constructor). It may be the cause. onnxruntime optimizes inference with CPU on the first call but has to start again with the second call (using CUDA).

nistarlwc · 2024-01-18T00:47:58Z

@xadupre Thank you for you reply

The first call to sesssion.run is slower than the second call to sesssion.run .
And the second call to sesssion.run is slower than the third call to sesssion.run .
After the 4th or 5th time, the call can be stable to same time.
If wait some seconds, then the next call will be slower.

Only use CUDA to predict, can look the code, sess_providers = ['CUDAExecutionProvider']

To be more precise, the first call to method predict (so second call to sesssion.run) is still much slower than the other call? Are you using GPU for the inference? (the first call is using CPU in the constructor). It may be the cause. onnxruntime optimizes inference with CPU on the first call but has to start again with the second call (using CUDA).

xadupre · 2024-01-18T08:42:42Z

For the first iteration, you data is copied from cpu to gpu. Maybe that's not the case for the others. CUDA is usually faster after a few iterations (warm-up). Benchmarks on CUDA usually expose a warmup parameter to take that effect into account.

hariharans29 · 2024-01-18T17:52:02Z

Are the image dimensions fixed or bound to vary ? Please see this issue if your image dimensions are dynamic and bound to vary to optimize for this use-case. Also see this related documentation.

In general, the first inference run is expected to be a lot slower than the second run as the first run is where most CUDA memory allocations happen (this is costly) and cached in the memory pool for subsequent runs. Ensure that the warm up run (first run) you do is for the same image shape as the subsequent runs if the image size is fixed. If you do this, the second run shouldn't be a lot slower than the third run (assuming image dimensions are fixed between second and third calls). If you have ensured all the above, how slow is the second inference call relative to the third call ?

'If wait some seconds, then the next call will be slower.' - Are you saying that if there is a delay introduced between runs, then inference runs are slower? If so, please see this issue

nistarlwc · 2024-01-19T05:18:45Z

@xadupre
I think that the problem is warm-up too, but how to solve?
In the project, the run time is very importent.

nistarlwc · 2024-01-19T05:34:39Z

@hariharans29 Thank you for you reply
The image dimensions are fixed.
Try to set GPU power, like

But the run time is not improved

nistarlwc · 2024-01-19T05:41:11Z

The test results:

Use 300 images for a Iteration
fisr Iteration:
run time : 110.5
run time : 79.6
run time : 54.3
run time : 6.9
run time : 6.9
run time : 6.9
......

wait 2s and run second Iteration:
run time : 57.8
run time : 56.8
run time : 58.8
run time : 6.9
run time : 6.9
run time : 6.9
......

nistarlwc · 2024-01-29T12:11:43Z

@xadupre @hariharans29 Help!!!
The problem is very serious.
Sometimes first ~10th predict will be very slow.

I try to test with tensorflow, but don't have the problem.
I think the problem is static graph and dynamic graph.
But how to use onnxruntime with staticgraph?

xadupre · 2024-01-29T13:40:53Z

Is it possible to share the full script you use to run your benchmark?

nistarlwc · 2024-01-31T08:26:30Z

@xadupre https://github.com/nistarlwc/test-onnx-fastapi

xadupre · 2024-01-31T10:28:49Z

So you run onnxruntime in a multithreaded environment. Based on your code, you have one instance of onnxruntime potentially called from multiple threads. onnxruntime is designed to use all the cores by default. Python should avoid mutliple calls to onnxruntiume at the same time (GIL) but maybe onnxruntime is changing the way it manages the memory if it detects multiple threads coming in. Maybe @hariharans29 knows more about that.

nistarlwc · 2024-02-01T01:10:02Z

@xadupre @hariharans29, although the httpserver is used with the multi-threading model, when onnxruntime is called, the images are predicted one by one.

github-actions bot added ep:CUDA issues related to the CUDA execution provider platform:windows issues related to the Windows platform labels Jan 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance] Why run first inference so slow, although run one time in initialzation? #19177

[Performance] Why run first inference so slow, although run one time in initialzation? #19177

nistarlwc commented Jan 17, 2024 •

edited by xadupre

Loading

xadupre commented Jan 17, 2024

nistarlwc commented Jan 18, 2024

xadupre commented Jan 18, 2024

hariharans29 commented Jan 18, 2024

nistarlwc commented Jan 19, 2024

nistarlwc commented Jan 19, 2024

nistarlwc commented Jan 19, 2024

nistarlwc commented Jan 29, 2024

xadupre commented Jan 29, 2024

nistarlwc commented Jan 31, 2024

xadupre commented Jan 31, 2024

nistarlwc commented Feb 1, 2024 •

edited

Loading

[Performance] Why run first inference so slow, although run one time in initialzation? #19177

[Performance] Why run first inference so slow, although run one time in initialzation? #19177

Comments

nistarlwc commented Jan 17, 2024 • edited by xadupre Loading

Describe the issue

To reproduce

Urgency

Platform

OS Version

ONNX Runtime Installation

ONNX Runtime Version or Commit ID

ONNX Runtime API

Architecture

Execution Provider

Execution Provider Library Version

Model File

Is this a quantized model?

xadupre commented Jan 17, 2024

nistarlwc commented Jan 18, 2024

xadupre commented Jan 18, 2024

hariharans29 commented Jan 18, 2024

nistarlwc commented Jan 19, 2024

nistarlwc commented Jan 19, 2024

nistarlwc commented Jan 19, 2024

nistarlwc commented Jan 29, 2024

xadupre commented Jan 29, 2024

nistarlwc commented Jan 31, 2024

xadupre commented Jan 31, 2024

nistarlwc commented Feb 1, 2024 • edited Loading

nistarlwc commented Jan 17, 2024 •

edited by xadupre

Loading

nistarlwc commented Feb 1, 2024 •

edited

Loading