[Feature Request] #20010

inisis · 2024-03-21T09:31:36Z

Describe the feature request

Hi, onnx and onnxruntime are great. I have built a tool named onnxslim, which can help optimize onnx model especially large language model, is there any chance that this tool can be used in onnxruntime repo. Thanks.

Describe scenario use case

pip install onnxslim
onnxslim raw.onnx slim.onnx

example show how onnxslim can slim qwen-1.8b from alibaba

The text was updated successfully, but these errors were encountered:

inisis · 2024-03-21T09:33:01Z

@tianleiwu Can you please review it

tianleiwu · 2024-03-21T18:09:31Z

@inisis, thanks for creating a helpful tool for ONNX community.

Onnx Runtime has graph optimizations during creating session. They are implemented in C++ as listed in https://github.com/microsoft/onnxruntime/blob/06fe4f31131a6873a295ba47ed60f4cb16584296/orttraining/orttraining/core/optimizer/graph_transformer_utils.cc

Another is python based offline optimization tool for transformers:
https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/transformers/optimizer.py
It fuses some subgraph into custom operator like Attention/SkipLayerNorm/BiasGelu etc. It could also convert fp32 model to fp16 mixed precision model. It's targeted for popular models like BERT/BART/T5/StableDiffusion. After fusion is done, there are only essential nodes left in onnx graph, and I think onnxslim might not help much in those models.

Related doc can be found here:
https://onnxruntime.ai/docs/performance/model-optimizations/graph-optimizations.html
https://onnxruntime.ai/docs/performance/transformers-optimization.html

For LLMs, we start using torch dynamo exporter. The fusion pattern could be different from torchscript based onnx exporter.

I did a quick look at onnxslim. some fusion patterns might be able to add to C++ optimizer. That need porting some code from python to C++.

inisis · 2024-03-22T02:10:33Z

@inisis, thanks for creating a helpful tool for ONNX community.

Onnx Runtime has graph optimizations during creating session. They are implemented in C++ as listed in https://github.com/microsoft/onnxruntime/blob/06fe4f31131a6873a295ba47ed60f4cb16584296/orttraining/orttraining/core/optimizer/graph_transformer_utils.cc

Another is python based offline optimization tool for transformers: https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/transformers/optimizer.py It fuses some subgraph into custom operator like Attention/SkipLayerNorm/BiasGelu etc. It could also convert fp32 model to fp16 mixed precision model. It's targeted for popular models like BERT/BART/T5/StableDiffusion. After fusion is done, there are only essential nodes left in onnx graph, and I think onnxslim might not help much in those models.

Related doc can be found here: https://onnxruntime.ai/docs/performance/model-optimizations/graph-optimizations.html https://onnxruntime.ai/docs/performance/transformers-optimization.html

For LLMs, we start using torch dynamo exporter. The fusion pattern could be different from torchscript based onnx exporter.

I did a quick look at onnxslim. some fusion patterns might be able to add to C++ optimizer. That need porting some code from python to C++.

so the reason why I wrote onnxslim is that I feel C++ based project is hard for beginners, but onnxslim is pure python, and onnxslim aims at more generalized optimization techniques but not platform targeted. I'm also working on torch dynamo exported onnx with onnxslim, hope to hear more details from you, thanks!

inisis · 2024-11-07T14:07:14Z

Hi @tianleiwu is there any doc about torch dynamo fusion as you mentioned before

tianleiwu · 2024-11-07T21:27:47Z

Hi @tianleiwu is there any doc about torch dynamo fusion as you mentioned before

Please take a look at model builder for LLM:
https://github.com/microsoft/onnxruntime-genai/blob/main/src/python/py/models/README.md

LLM usually need some type of quantization and some special handling of kv-cache. That makes it difficult to export and fusion. Our current approach is to directly generate the optimized onnx graph. It only supports popular LLM models though.

inisis added the feature request request for unsupported feature or enhancement label Mar 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] #20010

[Feature Request] #20010

inisis commented Mar 21, 2024

inisis commented Mar 21, 2024

tianleiwu commented Mar 21, 2024 •

edited

Loading

inisis commented Mar 22, 2024

inisis commented Nov 7, 2024

tianleiwu commented Nov 7, 2024

[Feature Request] #20010

[Feature Request] #20010

Comments

inisis commented Mar 21, 2024

Describe the feature request

Describe scenario use case

inisis commented Mar 21, 2024

tianleiwu commented Mar 21, 2024 • edited Loading

inisis commented Mar 22, 2024

inisis commented Nov 7, 2024

tianleiwu commented Nov 7, 2024

tianleiwu commented Mar 21, 2024 •

edited

Loading