This extensions help you to run generatice AI with ONNX Runtime( https://github.com/microsoft/onnxruntime-genai). It provides the generative AI loop for ONNX models, including inference with ONNX Runtime, logits processing, search and sampling, and KV cache management. Developers can call a high level generate() method, or run each iteration of the model in a loop, generating one token at a time, and optionally updating generation parameters inside the loop.It has support for greedy/beam search and TopP, TopK sampling to generate token sequences and built-in logits processing like repetition penalties. You can also easily add custom scoring.
At the application level, you can use Generative AI extensions for onnxruntime to build applications using C++/ C# / Python. At the model level, you can use it to merge fine-tuned models and do related quantitative deployment work.
Generative AI extensions for onnxruntime support quantization conversion of Microsoft Phi , Google Gemma, Mistral, Meta LLaMA。
The model builder greatly accelerates creating optimized and quantized ONNX models that run with the ONNX Runtime generate() API.
Through Model Builder, you can quantize the model to INT4, INT8, FP16, FP32, and combine different hardware acceleration methods such as CPU, CUDA, DirectML, Mobile, etc.
To use Model Builder you need to install
pip install torch transformers onnx onnxruntime
pip install --pre onnxruntime-genai
After installation, you can run the Model Builder script from the terminal to perform model format and quantization conversion.
python3 -m onnxruntime_genai.models.builder -m model_name -o path_to_output_folder -p precision -e execution_provider -c cache_dir_to_save_hf_files
Understand the relevant parameters
-
model_name This is the model on Hugging face, such as microsoft/Phi-3.5-mini-instruct, microsoft/Phi-3.5-vision-instruct, etc. It can also be the path where you store the model
-
path_to_output_folder Quantized conversion save path
-
execution_provider Different hardware acceleration support, such as cpu, cuda, DirectML
-
cache_dir_to_save_hf_files We download the model from Hugging face and cache it locally
Note:
- Although Generative AI extensions for onnxruntime are in preview, they have been incorporated into Microsoft Olive, and you can also call Generative AI extensions for onnxruntime Model Builder functions through Microsoft Olive.
Model Builder now supports ONNX model quantization for Phi-3.5 Instruct and Phi-3.5-Vision
CPU accelerated conversion of quantized INT 4
python3 -m onnxruntime_genai.models.builder -m microsoft/Phi-3.5-mini-instruct -o ./onnx-cpu -p int4 -e cpu -c ./Phi-3.5-mini-instruct
CUDA accelerated conversion of quantized INT 4
python3 -m onnxruntime_genai.models.builder -m microsoft/Phi-3.5-mini-instruct -o ./onnx-cpu -p int4 -e cuda -c ./Phi-3.5-mini-instruct
python3 -m onnxruntime_genai.models.builder -m microsoft/Phi-3.5-mini-instruct -o ./onnx-cpu -p int4 -e cuda -c ./Phi-3.5-mini-instruct
Phi-3.5-vision-instruct-onnx-cpu-fp32
- Set environment in terminal
mkdir models
cd models
-
Download microsoft/Phi-3.5-vision-instruct in models folder https://huggingface.co/microsoft/Phi-3.5-vision-instruct
-
Please download these files to Your Phi-3.5-vision-instruct folder
-
https://huggingface.co/lokinfey/Phi-3.5-vision-instruct-onnx-cpu/resolve/main/onnx/config.json
-
https://huggingface.co/lokinfey/Phi-3.5-vision-instruct-onnx-cpu/blob/main/onnx/modeling_phi3_v.py
-
Download this file to models folder https://huggingface.co/lokinfey/Phi-3.5-vision-instruct-onnx-cpu/blob/main/onnx/build.py
-
Go to terminal
Convert ONNX support with FP32
python build.py -i .\Your Phi-3.5-vision-instruct Path\ -o .\vision-cpu-fp32 -p f32 -e cpu
-
Model Builder currently supports the conversion of Phi-3.5-Instruct and Phi-3.5-Vision, but not Phi-3.5-MoE
-
To use ONNX's quantized model, you can use it through Generative AI extensions for onnxruntime SDK
-
We need to consider more responsible AI, so after the model quantization conversion, it is recommended to conduct more effective result testing
-
By quantizing the CPU INT4 model, we can deploy it to Edge Device, which has better application scenarios, so we have completed Phi-3.5-Instruct around INT 4
-
Learn more about Generative AI extensions for onnxruntime https://onnxruntime.ai/docs/genai/
-
Generative AI extensions for onnxruntime GitHub Repo https://github.com/microsoft/onnxruntime-genai