[Feature Request] phi-3-small-128k-onnx-cpu model #519

Ben-Epstein · 2024-05-24T21:06:56Z

The onnx-gpu model for phi-3-small-128k is great, a perfect balance of quality and speed. Is there a plan to support a cpu version?

Thanks!

baijumeswani · 2024-05-28T15:08:01Z

The Phi-Small model contains the SparseAttention operator and requires the kernel to be defined and implemented in ONNX Runtime. As of now, we only have the kernel implemented for CUDA.
We intend to add a CPU kernel as well in the near future. Once that is added, we will be able to support Phi-Small on CPU as well.

andliang · 2024-06-07T02:08:37Z

We intend to add a CPU kernel as well in the near future. Once that is added, we will be able to support Phi-Small on CPU as well.

Interesting... according to these two pages Run Phi-3 language models with the ONNX Runtime generate() API and Run the Phi-3 vision model with the ONNX Runtime generate() API, they can run on CPU. Am I missing something?

Just a FYI, Im new to Python/Cuda/Pytorch/etc and this ecosystem.

baijumeswani · 2024-06-10T03:04:24Z

Interesting... according to these two pages Run Phi-3 language models with the ONNX Runtime generate() API and Run the Phi-3 vision model with the ONNX Runtime generate() API, they can run on CPU. Am I missing something?

All phi3 family models except for phi3-small can be run on CPU. The linked documentation doesn't mention the phi3-small model. Maybe we should explicitly call it out in the doc.

kunal-vaishnavi · 2024-07-18T06:48:18Z

The support for Phi-3 small on CPU has been added in this PR. You can follow the instructions in the PR to get your environment and files set up and then use those changes to generate a Phi-3 small ONNX model for CPU.

### Description This PR adds support for building Phi-3 small ONNX models for CPU in the model builder. ### Motivation and Context Previously, the `SparseAttention` operator was only supported on CUDA in ONNX Runtime. With the [recent support](microsoft/onnxruntime#21110) for `SparseAttention` on CPU, Phi-3 small ONNX models can now run on CPU. This PR also helps [this issue](#519). To use these changes, both ONNX Runtime and ONNX Runtime GenAI need to be [built from source](https://onnxruntime.ai/docs/genai/howto/build-from-source.html#option-3-build-from-source). Because the official PyTorch repo does not have a `tokenizer.json` file, the `tokenizer.json` file needed for Phi-3 small in ONNX Runtime GenAI can be downloaded from the Hugging Face repos. Please see [here](https://huggingface.co/microsoft/Phi-3-small-8k-instruct-onnx-cuda/blob/main/cuda-int4-rtn-block-32/tokenizer.json) for Phi-3 small 8K and [here](https://huggingface.co/microsoft/Phi-3-small-128k-instruct-onnx-cuda/blob/main/cuda-int4-rtn-block-32/tokenizer.json) for Phi-3 small 128K.

baijumeswani added the enhancement New feature or request label Jun 10, 2024

baijumeswani assigned kunal-vaishnavi Jun 10, 2024

kunal-vaishnavi mentioned this issue Jul 18, 2024

Add Phi-3 small on CPU in model builder #710

Merged

kunal-vaishnavi closed this as completed Jul 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] phi-3-small-128k-onnx-cpu model #519

[Feature Request] phi-3-small-128k-onnx-cpu model #519

Ben-Epstein commented May 24, 2024

baijumeswani commented May 28, 2024

andliang commented Jun 7, 2024

baijumeswani commented Jun 10, 2024

kunal-vaishnavi commented Jul 18, 2024

[Feature Request] phi-3-small-128k-onnx-cpu model #519

[Feature Request] phi-3-small-128k-onnx-cpu model #519

Comments

Ben-Epstein commented May 24, 2024

baijumeswani commented May 28, 2024

andliang commented Jun 7, 2024

baijumeswani commented Jun 10, 2024

kunal-vaishnavi commented Jul 18, 2024