-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Performance] MultiHeadAttention CPU kernel slower than unfused #19924
Comments
@BowenBao Which version of PyTorch are you currently using? |
PyTorch is 2.2.1+cpu. onnxscript and onnx are also most recent version. (Updated pytorch version, I made a mistake previously) |
The cpu for the repro was Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz from azure Standard F64s v2 |
### Description <!-- Describe your changes. --> The cost computation of ComputeVxAttentionScore is wrong. It should be sequence_length * v_head_size * total_sequence_length instead of sequence_length * v_head_size * sequence_length. The PR also fine-tuned the cost computation. on my local box with i9 cpu, the performance is same as unfused version, but it is much faster on an azure vm with 16 threads. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> #19924
This issue has been automatically marked as stale due to inactivity and will be closed in 30 days if no further activity occurs. If further support is needed, please provide an update and/or more details. |
### Description <!-- Describe your changes. --> The cost computation of ComputeVxAttentionScore is wrong. It should be sequence_length * v_head_size * total_sequence_length instead of sequence_length * v_head_size * sequence_length. The PR also fine-tuned the cost computation. on my local box with i9 cpu, the performance is same as unfused version, but it is much faster on an azure vm with 16 threads. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> microsoft#19924
Describe the issue
As title. Running the below repro script does a lite benchmark, as well as saves onnx model files to disk for further analysis.
To reproduce
Urgency
Negatively affecting Llm inference on CPU w/ ORT.
Platform
Linux
OS Version
20.04
ONNX Runtime Installation
Built from Source
ONNX Runtime Version or Commit ID
33578cc
ONNX Runtime API
Python
Architecture
X64
Execution Provider
Default CPU
Execution Provider Library Version
No response
Model File
No response
Is this a quantized model?
No
The text was updated successfully, but these errors were encountered: