Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

speedup #1

Open
ltm920716 opened this issue Aug 23, 2021 · 10 comments
Open

speedup #1

ltm920716 opened this issue Aug 23, 2021 · 10 comments

Comments

@ltm920716
Copy link

ltm920716 commented Aug 23, 2021

hello,
I found that there is speedup using tensorrt(fp32, fp16) inference, is that right?

And I found that batch inference for torch model has no speedup too. I do not know if there is something wrong for me

@k9ele7en
Copy link
Owner

Hi @ltm920716, yes, tensorRT (RT) has speedup inference, cause it optimized the model for inference in specific GPU it built (then infer) on.
You mean batch inference for torch model is traditional .pth inference? If .pth then my repo doesn't enhance it.
If use RT in Triton then we can further improve batch inference by optimize batch inference in Triton server (https://github.com/triton-inference-server/server/blob/main/docs/architecture.md#dynamic-batcher)...
FYI, tensorRT and Triton can bring further performance as we apply more optimizations on them:

@ltm920716
Copy link
Author

@k9ele7en ,I test traditional .pth model on Tesla V100(32G), and I found that there is no speedup. So I think maybe the network layer is too large to speedup with batch inference,I am so confused.

@k9ele7en
Copy link
Owner

@ltm920716 , you test .pth on local or .pt (Torchscript) on Triton server (put in Model Repository)?

@ltm920716
Copy link
Author

ltm920716 commented Aug 23, 2021

@ltm920716 , you test .pth on local or .pt (Torchscript) on Triton server (put in Model Repository)?

@k9ele7en I test original torch model craft_mlt_25k.pth on torch==1.7.0 with batch,and there is no speedup.
Then I test craft_mlt_25k.trt(torch-onnx-trt),there is no speedup too for FP32. I test only the model inference time.

With tensorrt, I test that FP32 no speedup, FP16 is faster, and FP16 is the same speed with INT8.

@ltm920716
Copy link
Author

here is the results from the original test.py in craft git:

image
torch.Size([1, 3, 1280, 736])
time up?: 0.05096149444580078
time up?: 0.04998612403869629
time up?: 0.05093955993652344
time up?: 0.05080008506774902
time up?: 0.05109596252441406

image
torch.Size([8, 3, 1280, 736])
time up?: 0.39319276809692383
time up?: 0.39789867401123047
time up?: 0.39710474014282227
time up?: 0.39400172233581543
time up?: 0.39536428451538086

so I am so confused.

@k9ele7en
Copy link
Owner

@ltm920716 , you test .pth on local or .pt (Torchscript) on Triton server (put in Model Repository)?

@k9ele7en I test original torch model craft_mlt_25k.pth on torch==1.7.0 with batch,and there is no speedup.
Then I test craft_mlt_25k.trt(torch-onnx-trt),there is no speedup too for FP32. I test only the model inference time.

With tensorrt, I test that FP32 no speedup, FP16 is faster, and FP16 is the same speed with INT8.

@ltm920716 , yes, model need to be large enough so that RT engine make difference in performance (time). I not sure CRAFT is big enough, just an example for large scale solution. TensorRT+Triton often combine together in big deploy inference solution such as in medical, manufacturing business...

@k9ele7en
Copy link
Owner

here is the results from the original test.py in craft git:

image
torch.Size([1, 3, 1280, 736])
time up?: 0.05096149444580078
time up?: 0.04998612403869629
time up?: 0.05093955993652344
time up?: 0.05080008506774902
time up?: 0.05109596252441406

image
torch.Size([8, 3, 1280, 736])
time up?: 0.39319276809692383
time up?: 0.39789867401123047
time up?: 0.39710474014282227
time up?: 0.39400172233581543
time up?: 0.39536428451538086

so I am so confused.

@ltm920716 , I have not benchmark or done experiment batching with RT yet. But currently in config, I set batch (dynamic input) fixed=1 for all three values (min, max, opt), you can try again by set max batch size before export ONNX, then RT...
(

_C.INFERENCE.TRT_MIN_SHAPE = (1,3,256,256)
)

Let refer to this: https://docs.nvidia.com/deeplearning/tensorrt/best-practices/index.html#batching
Btw, Why don't you pass x with batch size 3 to .pth model, then compare to RT engine...

@ltm920716
Copy link
Author

#1 (comment)
@k9ele7en thanks!
the test above is only in original torch model, and I found that batch inference has no effect improve. I will compare with RT next. Thanks again

@k9ele7en
Copy link
Owner

@ltm920716 no problems, it would be great when you share the results (both bad/good) of your experiment so that we can discuss and people can find useful informations and avoid mistakes in future...

@ltm920716
Copy link
Author

@ltm920716 no problems, it would be great when you share the results (both bad/good) of your experiment so that we can discuss and people can find useful informations and avoid mistakes in future...

ok,I will try

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants