-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[js/webgpu] optimize MatmulNBits #21747
Conversation
cc @guschmue FYI Not ready for review yet since I need to make this PR more comprehensive to support outputNumber > 1. |
@guschmue @fs-eire @satyajandhyala Please take a look, thanks. |
/azp run Windows ARM64 QNN CI Pipeline,Windows x64 QNN CI Pipeline,Windows CPU CI Pipeline,Windows GPU CUDA CI Pipeline,Windows GPU DML CI Pipeline,Windows GPU Doc Gen CI Pipeline,Windows GPU TensorRT CI Pipeline,ONNX Runtime Web CI Pipeline,Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline |
/azp run Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,orttraining-amd-gpu-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,onnxruntime-binary-size-checks-ci-pipeline |
/azp run Big Models,Linux Android Emulator QNN CI Pipeline,Android CI Pipeline,iOS CI Pipeline,ONNX Runtime React Native CI Pipeline |
Azure Pipelines successfully started running 1 pipeline(s). |
2 similar comments
Azure Pipelines successfully started running 1 pipeline(s). |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run Windows ARM64 QNN CI Pipeline,Windows x64 QNN CI Pipeline,Windows CPU CI Pipeline,Windows GPU CUDA CI Pipeline,Windows GPU DML CI Pipeline,Windows GPU Doc Gen CI Pipeline,Windows GPU TensorRT CI Pipeline,ONNX Runtime Web CI Pipeline,Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline |
/azp run Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,orttraining-amd-gpu-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,onnxruntime-binary-size-checks-ci-pipeline |
/azp run Big Models,Linux Android Emulator QNN CI Pipeline,Android CI Pipeline,iOS CI Pipeline,ONNX Runtime React Native CI Pipeline |
Azure Pipelines successfully started running 1 pipeline(s). |
2 similar comments
Azure Pipelines successfully started running 1 pipeline(s). |
Azure Pipelines successfully started running 1 pipeline(s). |
@guschmue @fs-eire @satyajandhyala The code is ready for review. But I see three bots are failed |
/azp run ONNX Runtime Web CI Pipeline,Windows GPU CI Pipeline,Linux Android Emulator QNN CI Pipeline |
/azp run Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,Windows ARM64 QNN CI Pipeline,Windows CPU CI Pipeline |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run Windows GPU TensorRT CI Pipeline,onnxruntime-binary-size-checks-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,Windows x64 QNN CI Pipeline,Big Models |
Azure Pipelines could not run because the pipeline triggers exclude this branch/path. |
Azure Pipelines successfully started running 1 pipeline(s). |
sure going the right direction - I see 2x on Xe |
/azp run Windows GPU CUDA CI Pipeline,Windows GPU DML CI Pipeline,Windows GPU Doc Gen CI Pipeline |
Azure Pipelines could not run because the pipeline triggers exclude this branch/path. |
Is it possible to add improvement into existing code rather than writing new function? I understand refactoring is more demanding. However less code is easier to maintain. |
Yes, my final target is to only keep one blockwise program once the new added one supports all features, like zeroPoint as input, remove extra limitations, like |
@satyajandhyala The refactor is done. Currently, only one version is provided for MatmulNBits. All limitations are removed. |
Does the pr yields 2x perf improvement on Intel GPU while at least keeping perf same or better on Nvidia? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the pr yields 2x perf improvement on Intel GPU while at least keeping perf same or better on Nvidia?
I think so. This PR is a common optimization. Theoretically, it can bring perf improvement on all gpus. On Nvidia, I see about 80 tokens for phi3 on NV RTX 4090. At least, I didn't see any regression.
@satyajandhyala @guschmue Please help merge if no more issues. Thanks.
/azp run ONNX Runtime Web CI Pipeline,Windows GPU CI Pipeline,Linux Android Emulator QNN CI Pipeline |
/azp run Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,Windows ARM64 QNN CI Pipeline,Windows CPU CI Pipeline |
Azure Pipelines successfully started running 1 pipeline(s). |
Azure Pipelines could not run because the pipeline triggers exclude this branch/path. |
/azp run Windows GPU TensorRT CI Pipeline,onnxruntime-binary-size-checks-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,Windows x64 QNN CI Pipeline,Big Models |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run Windows GPU CUDA CI Pipeline,Windows GPU DML CI Pipeline,Windows GPU Doc Gen CI Pipeline |
Azure Pipelines could not run because the pipeline triggers exclude this branch/path. |
Description
See 2x speedup for phi3 on the integrated intel gpu with this optimization.
The optimization is mainly to store input A's data into local variable instead of loading them from global memory each time when calculate them with B data.
Motivation and Context