Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[webgpu] Always use tile matmulnbits for block_size = 32 #23140

Merged
merged 1 commit into from
Dec 20, 2024

Conversation

qjia7
Copy link
Contributor

@qjia7 qjia7 commented Dec 18, 2024

Description

After the optimization of prefill time with #23102, it seems that always using the tile matmulnibits with block_size = 32 can bring better performance even for discrete gpu for phi3 model.

Phi3 becomes 42.64 tokens/sec from 32.82 tokens/sec in easy mode on my NV RTX 2000 GPU.

@qjia7
Copy link
Contributor Author

qjia7 commented Dec 18, 2024

@guschmue @fs-eire Please help check other gpus you have at hand to see the overall results. Thanks.

@guschmue
Copy link
Contributor

guschmue commented Dec 19, 2024

yes, I can confirm - I had taken out the check for intel and saw gains on a2000, 3060 and m4.
Can run a full benchmark later today.

@guschmue
Copy link
Contributor

/azp run ONNX Runtime Web CI Pipeline,Windows GPU CI Pipeline,Linux Android Emulator QNN CI Pipeline

@guschmue
Copy link
Contributor

/azp run Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,Windows ARM64 QNN CI Pipeline,Windows CPU CI Pipeline

@guschmue
Copy link
Contributor

/azp run Windows GPU TensorRT CI Pipeline,onnxruntime-binary-size-checks-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,Windows x64 QNN CI Pipeline,Big Models

Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@guschmue
Copy link
Contributor

/azp run Windows GPU CUDA CI Pipeline,Windows GPU DML CI Pipeline,Windows GPU Doc Gen CI Pipeline

Copy link

Azure Pipelines successfully started running 3 pipeline(s).

Copy link

Azure Pipelines successfully started running 4 pipeline(s).

@guschmue guschmue added the ep:WebGPU ort-web webgpu provider label Dec 19, 2024
Copy link

Azure Pipelines successfully started running 9 pipeline(s).

@guschmue
Copy link
Contributor

for all models / scenarios for the gpu's impacted:

token/sec sppedup: avg ratio=1.19, >10% speedup=56.0%, >10% slowdown=4.0%, inside 10%=40.0%
prefill(500) speedup: sum avg ratio=2.20, >10% speedup=100.0%, >10% slowdown=0.0%, inside 10%=0.0%
prefill(1000) speedup: sum_long avg ratio=2.11, >10% speedup=100.0%, >10% slowdown=0.0%, inside 10%=0.0%

@guschmue guschmue merged commit 7c782f6 into microsoft:main Dec 20, 2024
75 checks passed
guschmue pushed a commit that referenced this pull request Dec 20, 2024
### Description
After the optimization of prefill time with #23102, it seems that always
using the tile matmulnibits with block_size = 32 can bring better
performance even for discrete gpu for phi3 model.

Phi3 becomes 42.64 tokens/sec from 32.82 tokens/sec in easy mode on my
NV RTX 2000 GPU.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ep:WebGPU ort-web webgpu provider
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants