Quantizing Phi-3.5 using Generative AI extensions for onnxruntime

What's Generative AI extensions for onnxruntime

This extensions help you to run generatice AI with ONNX Runtime( https://github.com/microsoft/onnxruntime-genai). It provides the generative AI loop for ONNX models, including inference with ONNX Runtime, logits processing, search and sampling, and KV cache management. Developers can call a high level generate() method, or run each iteration of the model in a loop, generating one token at a time, and optionally updating generation parameters inside the loop.It has support for greedy/beam search and TopP, TopK sampling to generate token sequences and built-in logits processing like repetition penalties. You can also easily add custom scoring.

At the application level, you can use Generative AI extensions for onnxruntime to build applications using C++/ C# / Python. At the model level, you can use it to merge fine-tuned models and do related quantitative deployment work.

Quantizing Phi-3.5 with Generative AI extensions for onnxruntime

Support Models

Generative AI extensions for onnxruntime support quantization conversion of Microsoft Phi , Google Gemma, Mistral, Meta LLaMA。

Model Builder in Generative AI extensions for onnxruntime

The model builder greatly accelerates creating optimized and quantized ONNX models that run with the ONNX Runtime generate() API.

Through Model Builder, you can quantize the model to INT4, INT8, FP16, FP32, and combine different hardware acceleration methods such as CPU, CUDA, DirectML, Mobile, etc.

To use Model Builder you need to install

pip install torch transformers onnx onnxruntime

pip install --pre onnxruntime-genai

After installation, you can run the Model Builder script from the terminal to perform model format and quantization conversion.

python3 -m onnxruntime_genai.models.builder -m model_name -o path_to_output_folder -p precision -e execution_provider -c cache_dir_to_save_hf_files

Understand the relevant parameters

model_name This is the model on Hugging face, such as microsoft/Phi-3.5-mini-instruct, microsoft/Phi-3.5-vision-instruct, etc. It can also be the path where you store the model
path_to_output_folder Quantized conversion save path
execution_provider Different hardware acceleration support, such as cpu, cuda, DirectML
cache_dir_to_save_hf_files We download the model from Hugging face and cache it locally

Note：

Although Generative AI extensions for onnxruntime are in preview, they have been incorporated into Microsoft Olive, and you can also call Generative AI extensions for onnxruntime Model Builder functions through Microsoft Olive.

How to use Model Builder to quantizing Phi-3.5

Model Builder now supports ONNX model quantization for Phi-3.5 Instruct and Phi-3.5-Vision

Phi-3.5-Instruct

CPU accelerated conversion of quantized INT 4

python3 -m onnxruntime_genai.models.builder -m microsoft/Phi-3.5-mini-instruct  -o ./onnx-cpu -p int4 -e cpu -c ./Phi-3.5-mini-instruct

CUDA accelerated conversion of quantized INT 4

python3 -m onnxruntime_genai.models.builder -m microsoft/Phi-3.5-mini-instruct  -o ./onnx-cpu -p int4 -e cuda -c ./Phi-3.5-mini-instruct

python3 -m onnxruntime_genai.models.builder -m microsoft/Phi-3.5-mini-instruct  -o ./onnx-cpu -p int4 -e cuda -c ./Phi-3.5-mini-instruct

Phi-3.5-Vision

Phi-3.5-vision-instruct-onnx-cpu-fp32

Set environment in terminal

mkdir models

cd models

Download microsoft/Phi-3.5-vision-instruct in models folder https://huggingface.co/microsoft/Phi-3.5-vision-instruct
Please download these files to Your Phi-3.5-vision-instruct folder

https://huggingface.co/lokinfey/Phi-3.5-vision-instruct-onnx-cpu/resolve/main/onnx/config.json
https://huggingface.co/lokinfey/Phi-3.5-vision-instruct-onnx-cpu/blob/main/onnx/image_embedding_phi3_v_for_onnx.py
https://huggingface.co/lokinfey/Phi-3.5-vision-instruct-onnx-cpu/blob/main/onnx/modeling_phi3_v.py

Download this file to models folder https://huggingface.co/lokinfey/Phi-3.5-vision-instruct-onnx-cpu/blob/main/onnx/build.py
Go to terminal

Convert ONNX support with FP32

python build.py -i .\Your Phi-3.5-vision-instruct Path\ -o .\vision-cpu-fp32 -p f32 -e cpu

Note：

Model Builder currently supports the conversion of Phi-3.5-Instruct and Phi-3.5-Vision, but not Phi-3.5-MoE
To use ONNX's quantized model, you can use it through Generative AI extensions for onnxruntime SDK
We need to consider more responsible AI, so after the model quantization conversion, it is recommended to conduct more effective result testing
By quantizing the CPU INT4 model, we can deploy it to Edge Device, which has better application scenarios, so we have completed Phi-3.5-Instruct around INT 4

Resources

Learn more about Generative AI extensions for onnxruntime https://onnxruntime.ai/docs/genai/
Generative AI extensions for onnxruntime GitHub Repo https://github.com/microsoft/onnxruntime-genai

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

022.UsingORTGenAIQuantifyingPhi35.md

022.UsingORTGenAIQuantifyingPhi35.md

Quantizing Phi-3.5 using Generative AI extensions for onnxruntime

What's Generative AI extensions for onnxruntime

Quantizing Phi-3.5 with Generative AI extensions for onnxruntime

Support Models

Model Builder in Generative AI extensions for onnxruntime

How to use Model Builder to quantizing Phi-3.5

Phi-3.5-Instruct

Phi-3.5-Vision

Note：

Resources

Files

022.UsingORTGenAIQuantifyingPhi35.md

Latest commit

History

022.UsingORTGenAIQuantifyingPhi35.md

File metadata and controls

Quantizing Phi-3.5 using Generative AI extensions for onnxruntime

What's Generative AI extensions for onnxruntime

Quantizing Phi-3.5 with Generative AI extensions for onnxruntime

Support Models

Model Builder in Generative AI extensions for onnxruntime

How to use Model Builder to quantizing Phi-3.5

Phi-3.5-Instruct

Phi-3.5-Vision

Note：

Resources