Add Phi-3 small on CPU in model builder #710
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
This PR adds support for building Phi-3 small ONNX models for CPU in the model builder.
Motivation and Context
Previously, the
SparseAttention
operator was only supported on CUDA in ONNX Runtime. With the recent support forSparseAttention
on CPU, Phi-3 small ONNX models can now run on CPU. This PR also helps this issue.To use these changes, both ONNX Runtime and ONNX Runtime GenAI need to be built from source. Because the official PyTorch repo does not have a
tokenizer.json
file, thetokenizer.json
file needed for Phi-3 small in ONNX Runtime GenAI can be downloaded from the Hugging Face repos. Please see here for Phi-3 small 8K and here for Phi-3 small 128K.