[CPU] SparseAttention op #21110

tianleiwu · 2024-06-20T03:43:39Z

Description

Add SparseAttention cpu implementation.

Refactoring GQAAttentionBase
Add SparseAttention implementation
Add test cases

This is unfused implementation. The flash attention version will be added later.

Motivation and Context

onnxruntime/contrib_ops/cpu/sparse/sparse_attention.h

github-advanced-security

PREfast found more than 20 potential problems in the proposed changes. Check the Files changed tab for more details.

onnxruntime/contrib_ops/cpu/bert/multihead_attention.cc

onnxruntime/contrib_ops/cpu/sparse/sparse_attention_base.h

onnxruntime/contrib_ops/cpu/sparse/sparse_attention.cc

onnxruntime/test/python/transformers/test_sparse_attention.py

onnxruntime/contrib_ops/cpu/bert/attention_base.h

onnxruntime/contrib_ops/cpu/sparse/sparse_attention_base.h

onnxruntime/test/python/transformers/test_sparse_attention.py

onnxruntime/contrib_ops/cpu/sparse/sparse_attention_helper.h

### Description This PR adds support for building Phi-3 small ONNX models for CPU in the model builder. ### Motivation and Context Previously, the `SparseAttention` operator was only supported on CUDA in ONNX Runtime. With the [recent support](microsoft/onnxruntime#21110) for `SparseAttention` on CPU, Phi-3 small ONNX models can now run on CPU. This PR also helps [this issue](#519). To use these changes, both ONNX Runtime and ONNX Runtime GenAI need to be [built from source](https://onnxruntime.ai/docs/genai/howto/build-from-source.html#option-3-build-from-source). Because the official PyTorch repo does not have a `tokenizer.json` file, the `tokenizer.json` file needed for Phi-3 small in ONNX Runtime GenAI can be downloaded from the Hugging Face repos. Please see [here](https://huggingface.co/microsoft/Phi-3-small-8k-instruct-onnx-cuda/blob/main/cuda-int4-rtn-block-32/tokenizer.json) for Phi-3 small 8K and [here](https://huggingface.co/microsoft/Phi-3-small-128k-instruct-onnx-cuda/blob/main/cuda-int4-rtn-block-32/tokenizer.json) for Phi-3 small 128K.

tianleiwu added 3 commits June 19, 2024 20:38

cpu flash attention by duanqn

4924554

refactoring

f65e55d

Add header

9371e31

tianleiwu requested a review from a team as a code owner June 20, 2024 03:43

tianleiwu marked this pull request as draft June 20, 2024 03:43

github-advanced-security bot found potential problems Jun 20, 2024

View reviewed changes

onnxruntime/contrib_ops/cpu/sparse/sparse_attention.h Fixed Show fixed Hide fixed

github-advanced-security bot found potential problems Jun 20, 2024

View reviewed changes

tianleiwu added 7 commits June 19, 2024 21:59

fix linux build

e05241d

fix linux non amd64 build

3eaef7a

fix build warnings

8c4779e

handle unknown l2 cache size

b7dc09d

format and static_cast

4ef0fe7

test intra_op_num_threads

bb031d0

l2 cache size for mac os and BSD

c42e4eb

github-advanced-security bot found potential problems Jun 20, 2024

View reviewed changes

onnxruntime/contrib_ops/cpu/bert/multihead_attention.cc Fixed Show fixed Hide fixed

onnxruntime/contrib_ops/cpu/bert/multihead_attention.cc Fixed Show fixed Hide fixed

tianleiwu added 6 commits June 20, 2024 17:31

use smart pointer

c321c72

update doc

e68f60c

rename row to block, and tune block size

afc4325

output benchmark to csv

9af0703

move PackVIntoRotaryQKV to a new header file

a65bc41

Add test cases (no sparse)

4f4c814

github-advanced-security bot found potential problems Jun 27, 2024

View reviewed changes

onnxruntime/contrib_ops/cpu/sparse/sparse_attention_base.h Fixed Show fixed Hide fixed

onnxruntime/contrib_ops/cpu/sparse/sparse_attention.cc Fixed Show fixed Hide fixed

github-advanced-security bot found potential problems Jun 27, 2024

View reviewed changes

onnxruntime/test/python/transformers/test_sparse_attention.py Fixed Show fixed Hide fixed

onnxruntime/test/python/transformers/test_sparse_attention.py Dismissed Show dismissed Hide dismissed

tianleiwu added 5 commits June 27, 2024 15:38

format

861e653

support sparse

1de2033

undo mha

7550a31

format

49a4292

Merge branch 'main' into tlwu/cpu_sparse_attn

b67b706

tianleiwu marked this pull request as ready for review July 2, 2024 19:17

support sequence length > 1 for non prompt

2684e97

tianleiwu requested review from wangyems, kunal-vaishnavi, aciddelgado and yufenglee July 2, 2024 20:49