Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] #20010

Open
inisis opened this issue Mar 21, 2024 · 5 comments
Open

[Feature Request] #20010

inisis opened this issue Mar 21, 2024 · 5 comments
Labels
feature request request for unsupported feature or enhancement

Comments

@inisis
Copy link
Contributor

inisis commented Mar 21, 2024

Describe the feature request

Hi, onnx and onnxruntime are great. I have built a tool named onnxslim, which can help optimize onnx model especially large language model, is there any chance that this tool can be used in onnxruntime repo. Thanks.

Describe scenario use case

pip install onnxslim
onnxslim raw.onnx slim.onnx

example show how onnxslim can slim qwen-1.8b from alibaba

image

@inisis inisis added the feature request request for unsupported feature or enhancement label Mar 21, 2024
@inisis
Copy link
Contributor Author

inisis commented Mar 21, 2024

@tianleiwu Can you please review it

@tianleiwu
Copy link
Contributor

tianleiwu commented Mar 21, 2024

@inisis, thanks for creating a helpful tool for ONNX community.

Onnx Runtime has graph optimizations during creating session. They are implemented in C++ as listed in https://github.com/microsoft/onnxruntime/blob/06fe4f31131a6873a295ba47ed60f4cb16584296/orttraining/orttraining/core/optimizer/graph_transformer_utils.cc

Another is python based offline optimization tool for transformers:
https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/transformers/optimizer.py
It fuses some subgraph into custom operator like Attention/SkipLayerNorm/BiasGelu etc. It could also convert fp32 model to fp16 mixed precision model. It's targeted for popular models like BERT/BART/T5/StableDiffusion. After fusion is done, there are only essential nodes left in onnx graph, and I think onnxslim might not help much in those models.

Related doc can be found here:
https://onnxruntime.ai/docs/performance/model-optimizations/graph-optimizations.html
https://onnxruntime.ai/docs/performance/transformers-optimization.html

For LLMs, we start using torch dynamo exporter. The fusion pattern could be different from torchscript based onnx exporter.

I did a quick look at onnxslim. some fusion patterns might be able to add to C++ optimizer. That need porting some code from python to C++.

@inisis
Copy link
Contributor Author

inisis commented Mar 22, 2024

@inisis, thanks for creating a helpful tool for ONNX community.

Onnx Runtime has graph optimizations during creating session. They are implemented in C++ as listed in https://github.com/microsoft/onnxruntime/blob/06fe4f31131a6873a295ba47ed60f4cb16584296/orttraining/orttraining/core/optimizer/graph_transformer_utils.cc

Another is python based offline optimization tool for transformers: https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/transformers/optimizer.py It fuses some subgraph into custom operator like Attention/SkipLayerNorm/BiasGelu etc. It could also convert fp32 model to fp16 mixed precision model. It's targeted for popular models like BERT/BART/T5/StableDiffusion. After fusion is done, there are only essential nodes left in onnx graph, and I think onnxslim might not help much in those models.

Related doc can be found here: https://onnxruntime.ai/docs/performance/model-optimizations/graph-optimizations.html https://onnxruntime.ai/docs/performance/transformers-optimization.html

For LLMs, we start using torch dynamo exporter. The fusion pattern could be different from torchscript based onnx exporter.

I did a quick look at onnxslim. some fusion patterns might be able to add to C++ optimizer. That need porting some code from python to C++.

so the reason why I wrote onnxslim is that I feel C++ based project is hard for beginners, but onnxslim is pure python, and onnxslim aims at more generalized optimization techniques but not platform targeted. I'm also working on torch dynamo exported onnx with onnxslim, hope to hear more details from you, thanks!

@inisis
Copy link
Contributor Author

inisis commented Nov 7, 2024

Hi @tianleiwu is there any doc about torch dynamo fusion as you mentioned before
Image

@tianleiwu
Copy link
Contributor

Hi @tianleiwu is there any doc about torch dynamo fusion as you mentioned before Image

Please take a look at model builder for LLM:
https://github.com/microsoft/onnxruntime-genai/blob/main/src/python/py/models/README.md

LLM usually need some type of quantization and some special handling of kv-cache. That makes it difficult to export and fusion. Our current approach is to directly generate the optimized onnx graph. It only supports popular LLM models though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request request for unsupported feature or enhancement
Projects
None yet
Development

No branches or pull requests

2 participants