Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[js/webgpu] optimize MatmulNBits #21747

Merged
merged 11 commits into from
Aug 23, 2024
Merged

Conversation

qjia7
Copy link
Contributor

@qjia7 qjia7 commented Aug 15, 2024

Description

See 2x speedup for phi3 on the integrated intel gpu with this optimization.

The optimization is mainly to store input A's data into local variable instead of loading them from global memory each time when calculate them with B data.

Motivation and Context

@qjia7
Copy link
Contributor Author

qjia7 commented Aug 15, 2024

cc @guschmue FYI Not ready for review yet since I need to make this PR more comprehensive to support outputNumber > 1.

@qjia7 qjia7 changed the title [WIP] Opt matmulnbits Opt matmulnbits Aug 16, 2024
@qjia7 qjia7 marked this pull request as ready for review August 16, 2024 06:17
@qjia7
Copy link
Contributor Author

qjia7 commented Aug 16, 2024

@guschmue @fs-eire @satyajandhyala Please take a look, thanks.

@fs-eire
Copy link
Contributor

fs-eire commented Aug 16, 2024

/azp run Windows ARM64 QNN CI Pipeline,Windows x64 QNN CI Pipeline,Windows CPU CI Pipeline,Windows GPU CUDA CI Pipeline,Windows GPU DML CI Pipeline,Windows GPU Doc Gen CI Pipeline,Windows GPU TensorRT CI Pipeline,ONNX Runtime Web CI Pipeline,Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline

@fs-eire
Copy link
Contributor

fs-eire commented Aug 16, 2024

/azp run Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,orttraining-amd-gpu-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,onnxruntime-binary-size-checks-ci-pipeline

@fs-eire
Copy link
Contributor

fs-eire commented Aug 16, 2024

/azp run Big Models,Linux Android Emulator QNN CI Pipeline,Android CI Pipeline,iOS CI Pipeline,ONNX Runtime React Native CI Pipeline

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

2 similar comments
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@fs-eire fs-eire changed the title Opt matmulnbits [js/webgpu] optimize MatmulNBits Aug 16, 2024
@fs-eire
Copy link
Contributor

fs-eire commented Aug 18, 2024

/azp run Windows ARM64 QNN CI Pipeline,Windows x64 QNN CI Pipeline,Windows CPU CI Pipeline,Windows GPU CUDA CI Pipeline,Windows GPU DML CI Pipeline,Windows GPU Doc Gen CI Pipeline,Windows GPU TensorRT CI Pipeline,ONNX Runtime Web CI Pipeline,Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline

@fs-eire
Copy link
Contributor

fs-eire commented Aug 18, 2024

/azp run Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,orttraining-amd-gpu-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,onnxruntime-binary-size-checks-ci-pipeline

@fs-eire
Copy link
Contributor

fs-eire commented Aug 18, 2024

/azp run Big Models,Linux Android Emulator QNN CI Pipeline,Android CI Pipeline,iOS CI Pipeline,ONNX Runtime React Native CI Pipeline

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

2 similar comments
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@qjia7
Copy link
Contributor Author

qjia7 commented Aug 19, 2024

@guschmue @fs-eire @satyajandhyala The code is ready for review. But I see three bots are failed ONNX Runtime React Native CI Pipeline, ONNX Runtime React Native CI Pipeline (React Native CI ReactNative_CI), ONNX Runtime Web CI Pipeline (Test_web_MultiBrowsers build_onnxruntime_web_windows) in this PR and another PR #19388. I think the errors are not related with my changes. Please let me know if you know the possible reasons. Thanks.

@guschmue
Copy link
Contributor

/azp run ONNX Runtime Web CI Pipeline,Windows GPU CI Pipeline,Linux Android Emulator QNN CI Pipeline

@guschmue
Copy link
Contributor

/azp run Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,Windows ARM64 QNN CI Pipeline,Windows CPU CI Pipeline

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@guschmue
Copy link
Contributor

/azp run Windows GPU TensorRT CI Pipeline,onnxruntime-binary-size-checks-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,Windows x64 QNN CI Pipeline,Big Models

Copy link

Azure Pipelines could not run because the pipeline triggers exclude this branch/path.

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@guschmue
Copy link
Contributor

sure going the right direction - I see 2x on Xe

@guschmue
Copy link
Contributor

/azp run Windows GPU CUDA CI Pipeline,Windows GPU DML CI Pipeline,Windows GPU Doc Gen CI Pipeline

Copy link

Azure Pipelines could not run because the pipeline triggers exclude this branch/path.

@satyajandhyala satyajandhyala added the ep:WebGPU ort-web webgpu provider label Aug 20, 2024
@satyajandhyala
Copy link
Contributor

Is it possible to add improvement into existing code rather than writing new function? I understand refactoring is more demanding. However less code is easier to maintain.

@qjia7
Copy link
Contributor Author

qjia7 commented Aug 20, 2024

Is it possible to add improvement into existing code rather than writing new function? I understand refactoring is more demanding. However less code is easier to maintain.

Yes, my final target is to only keep one blockwise program once the new added one supports all features, like zeroPoint as input, remove extra limitations, like nBlocksPerCol < maxComputeWorkgroupSizes[0]. To speedup the progress, I use a new program instead of modifying the current code. In my local, I am also experimenting other optimization ways to see the result. If everything goes well, I will do the refactoring or submit a new optimization way in the following days. Thanks.

@qjia7
Copy link
Contributor Author

qjia7 commented Aug 22, 2024

@satyajandhyala The refactor is done. Currently, only one version is provided for MatmulNBits. All limitations are removed.

@satyajandhyala
Copy link
Contributor

Does the pr yields 2x perf improvement on Intel GPU while at least keeping perf same or better on Nvidia?

Copy link
Contributor Author

@qjia7 qjia7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the pr yields 2x perf improvement on Intel GPU while at least keeping perf same or better on Nvidia?

I think so. This PR is a common optimization. Theoretically, it can bring perf improvement on all gpus. On Nvidia, I see about 80 tokens for phi3 on NV RTX 4090. At least, I didn't see any regression.

@satyajandhyala @guschmue Please help merge if no more issues. Thanks.

@qjia7 qjia7 requested a review from satyajandhyala August 23, 2024 05:12
@guschmue
Copy link
Contributor

/azp run ONNX Runtime Web CI Pipeline,Windows GPU CI Pipeline,Linux Android Emulator QNN CI Pipeline

@guschmue
Copy link
Contributor

/azp run Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,Windows ARM64 QNN CI Pipeline,Windows CPU CI Pipeline

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link

Azure Pipelines could not run because the pipeline triggers exclude this branch/path.

@guschmue
Copy link
Contributor

/azp run Windows GPU TensorRT CI Pipeline,onnxruntime-binary-size-checks-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,Windows x64 QNN CI Pipeline,Big Models

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@guschmue
Copy link
Contributor

/azp run Windows GPU CUDA CI Pipeline,Windows GPU DML CI Pipeline,Windows GPU Doc Gen CI Pipeline

Copy link

Azure Pipelines could not run because the pipeline triggers exclude this branch/path.

@guschmue guschmue merged commit 87165b9 into microsoft:main Aug 23, 2024
46 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ep:WebGPU ort-web webgpu provider
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants