Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] phi-3-small-128k-onnx-cpu model #519

Closed
Ben-Epstein opened this issue May 24, 2024 · 4 comments
Closed

[Feature Request] phi-3-small-128k-onnx-cpu model #519

Ben-Epstein opened this issue May 24, 2024 · 4 comments
Assignees
Labels
enhancement New feature or request

Comments

@Ben-Epstein
Copy link
Contributor

The onnx-gpu model for phi-3-small-128k is great, a perfect balance of quality and speed. Is there a plan to support a cpu version?

Thanks!

@baijumeswani
Copy link
Collaborator

The Phi-Small model contains the SparseAttention operator and requires the kernel to be defined and implemented in ONNX Runtime. As of now, we only have the kernel implemented for CUDA.
We intend to add a CPU kernel as well in the near future. Once that is added, we will be able to support Phi-Small on CPU as well.

@andliang
Copy link

andliang commented Jun 7, 2024

We intend to add a CPU kernel as well in the near future. Once that is added, we will be able to support Phi-Small on CPU as well.

Interesting... according to these two pages Run Phi-3 language models with the ONNX Runtime generate() API and Run the Phi-3 vision model with the ONNX Runtime generate() API, they can run on CPU. Am I missing something?

Just a FYI, Im new to Python/Cuda/Pytorch/etc and this ecosystem.

@baijumeswani
Copy link
Collaborator

Interesting... according to these two pages Run Phi-3 language models with the ONNX Runtime generate() API and Run the Phi-3 vision model with the ONNX Runtime generate() API, they can run on CPU. Am I missing something?

All phi3 family models except for phi3-small can be run on CPU. The linked documentation doesn't mention the phi3-small model. Maybe we should explicitly call it out in the doc.

@kunal-vaishnavi
Copy link
Contributor

The support for Phi-3 small on CPU has been added in this PR. You can follow the instructions in the PR to get your environment and files set up and then use those changes to generate a Phi-3 small ONNX model for CPU.

kunal-vaishnavi added a commit that referenced this issue Jul 29, 2024
### Description

This PR adds support for building Phi-3 small ONNX models for CPU in the
model builder.

### Motivation and Context

Previously, the `SparseAttention` operator was only supported on CUDA in
ONNX Runtime. With the [recent
support](microsoft/onnxruntime#21110) for
`SparseAttention` on CPU, Phi-3 small ONNX models can now run on CPU.
This PR also helps [this
issue](#519).

To use these changes, both ONNX Runtime and ONNX Runtime GenAI need to
be [built from
source](https://onnxruntime.ai/docs/genai/howto/build-from-source.html#option-3-build-from-source).
Because the official PyTorch repo does not have a `tokenizer.json` file,
the `tokenizer.json` file needed for Phi-3 small in ONNX Runtime GenAI
can be downloaded from the Hugging Face repos. Please see
[here](https://huggingface.co/microsoft/Phi-3-small-8k-instruct-onnx-cuda/blob/main/cuda-int4-rtn-block-32/tokenizer.json)
for Phi-3 small 8K and
[here](https://huggingface.co/microsoft/Phi-3-small-128k-instruct-onnx-cuda/blob/main/cuda-int4-rtn-block-32/tokenizer.json)
for Phi-3 small 128K.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants