-
Notifications
You must be signed in to change notification settings - Fork 145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature Request] phi-3-small-128k-onnx-cpu model #519
Comments
The Phi-Small model contains the SparseAttention operator and requires the kernel to be defined and implemented in ONNX Runtime. As of now, we only have the kernel implemented for CUDA. |
Interesting... according to these two pages Run Phi-3 language models with the ONNX Runtime generate() API and Run the Phi-3 vision model with the ONNX Runtime generate() API, they can run on CPU. Am I missing something? Just a FYI, Im new to Python/Cuda/Pytorch/etc and this ecosystem. |
All phi3 family models except for phi3-small can be run on CPU. The linked documentation doesn't mention the phi3-small model. Maybe we should explicitly call it out in the doc. |
The support for Phi-3 small on CPU has been added in this PR. You can follow the instructions in the PR to get your environment and files set up and then use those changes to generate a Phi-3 small ONNX model for CPU. |
### Description This PR adds support for building Phi-3 small ONNX models for CPU in the model builder. ### Motivation and Context Previously, the `SparseAttention` operator was only supported on CUDA in ONNX Runtime. With the [recent support](microsoft/onnxruntime#21110) for `SparseAttention` on CPU, Phi-3 small ONNX models can now run on CPU. This PR also helps [this issue](#519). To use these changes, both ONNX Runtime and ONNX Runtime GenAI need to be [built from source](https://onnxruntime.ai/docs/genai/howto/build-from-source.html#option-3-build-from-source). Because the official PyTorch repo does not have a `tokenizer.json` file, the `tokenizer.json` file needed for Phi-3 small in ONNX Runtime GenAI can be downloaded from the Hugging Face repos. Please see [here](https://huggingface.co/microsoft/Phi-3-small-8k-instruct-onnx-cuda/blob/main/cuda-int4-rtn-block-32/tokenizer.json) for Phi-3 small 8K and [here](https://huggingface.co/microsoft/Phi-3-small-128k-instruct-onnx-cuda/blob/main/cuda-int4-rtn-block-32/tokenizer.json) for Phi-3 small 128K.
The onnx-gpu model for phi-3-small-128k is great, a perfect balance of quality and speed. Is there a plan to support a cpu version?
Thanks!
The text was updated successfully, but these errors were encountered: