Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CPU] SparseAttention op #21110

Merged
merged 26 commits into from
Jul 4, 2024
Merged

[CPU] SparseAttention op #21110

merged 26 commits into from
Jul 4, 2024

Conversation

tianleiwu
Copy link
Contributor

@tianleiwu tianleiwu commented Jun 20, 2024

Description

Add SparseAttention cpu implementation.

  • Refactoring GQAAttentionBase
  • Add SparseAttention implementation
  • Add test cases

This is unfused implementation. The flash attention version will be added later.

Motivation and Context

@tianleiwu tianleiwu requested a review from a team as a code owner June 20, 2024 03:43
@tianleiwu tianleiwu marked this pull request as draft June 20, 2024 03:43
Copy link
Contributor

@github-advanced-security github-advanced-security bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PREfast found more than 20 potential problems in the proposed changes. Check the Files changed tab for more details.

@tianleiwu tianleiwu marked this pull request as ready for review July 2, 2024 19:17
@tianleiwu tianleiwu requested a review from kunal-vaishnavi July 3, 2024 00:07
kunal-vaishnavi
kunal-vaishnavi previously approved these changes Jul 3, 2024
wangyems
wangyems previously approved these changes Jul 3, 2024
@tianleiwu tianleiwu merged commit 7d9b12a into main Jul 4, 2024
93 of 100 checks passed
@tianleiwu tianleiwu deleted the tlwu/cpu_sparse_attn branch July 4, 2024 04:51
kunal-vaishnavi added a commit to microsoft/onnxruntime-genai that referenced this pull request Jul 29, 2024
### Description

This PR adds support for building Phi-3 small ONNX models for CPU in the
model builder.

### Motivation and Context

Previously, the `SparseAttention` operator was only supported on CUDA in
ONNX Runtime. With the [recent
support](microsoft/onnxruntime#21110) for
`SparseAttention` on CPU, Phi-3 small ONNX models can now run on CPU.
This PR also helps [this
issue](#519).

To use these changes, both ONNX Runtime and ONNX Runtime GenAI need to
be [built from
source](https://onnxruntime.ai/docs/genai/howto/build-from-source.html#option-3-build-from-source).
Because the official PyTorch repo does not have a `tokenizer.json` file,
the `tokenizer.json` file needed for Phi-3 small in ONNX Runtime GenAI
can be downloaded from the Hugging Face repos. Please see
[here](https://huggingface.co/microsoft/Phi-3-small-8k-instruct-onnx-cuda/blob/main/cuda-int4-rtn-block-32/tokenizer.json)
for Phi-3 small 8K and
[here](https://huggingface.co/microsoft/Phi-3-small-128k-instruct-onnx-cuda/blob/main/cuda-int4-rtn-block-32/tokenizer.json)
for Phi-3 small 128K.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants