Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

webnn: Support block-wise quantization for DirectML backend #49083

Merged
merged 1 commit into from
Nov 9, 2024

Conversation

chromium-wpt-export-bot
Copy link
Collaborator

@chromium-wpt-export-bot chromium-wpt-export-bot commented Nov 9, 2024

Block-wise quantization divides input tensors into smaller blocks that
are independently quantized, resulting in faster optimization and high
precision quantization 1. It is used for popular language models,
such as phi-3 mini int4 quantized model 2. Related WG issue 3 has
been opened to discussion.

Firstly, this CL validates scale and zero point tensors for block-wise
quantization. Besides, this CL also implements the block-wise
quantization in DirectML backend by using DML_OPERATOR_QUANTIZE and
DML_OPERATOR_DEQUANTIZE which are available in FL >= 6.3.

More validation and conformance tests are added to verify the
implementation.

Bug: 40206287
Change-Id: I977b0be57deebd7afcae216edc3ddc3818b8c09f
Cq-Include-Trybots: luci.chromium.try:mac14.arm64-blink-rel, mac14-blink-rel, mac15.arm64-blink-rel, mac15-blink-rel, linux-blink-rel
Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/5964816
Reviewed-by: Rafael Cintron <[email protected]>
Reviewed-by: ningxin hu <[email protected]>
Commit-Queue: ningxin hu <[email protected]>
Cr-Commit-Position: refs/heads/main@{#1380767}

Block-wise quantization divides input tensors into smaller blocks that
are independently quantized, resulting in faster optimization and high
precision quantization [1]. It is used for popular language models,
such as phi-3 mini int4 quantized model [2]. Related WG issue [3] has
been opened to discussion.

Firstly, this CL validates scale and zero point tensors for block-wise
quantization. Besides, this CL also implements the block-wise
quantization in DirectML backend by using DML_OPERATOR_QUANTIZE and
DML_OPERATOR_DEQUANTIZE which are available in FL >= 6.3.

More validation and conformance tests are added to verify the
implementation.

[1]: https://arxiv.org/abs/2110.02861
[2]: https://huggingface.co/microsoft/Phi-3-mini-4k-instruct
[3]: webmachinelearning/webnn#779

Bug: 40206287
Change-Id: I977b0be57deebd7afcae216edc3ddc3818b8c09f
Cq-Include-Trybots: luci.chromium.try:mac14.arm64-blink-rel, mac14-blink-rel, mac15.arm64-blink-rel, mac15-blink-rel, linux-blink-rel
Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/5964816
Reviewed-by: Rafael Cintron <[email protected]>
Reviewed-by: ningxin hu <[email protected]>
Commit-Queue: ningxin hu <[email protected]>
Cr-Commit-Position: refs/heads/main@{#1380767}
Copy link
Collaborator

@wpt-pr-bot wpt-pr-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The review process for this patch is being conducted in the Chromium project.

@chromium-wpt-export-bot chromium-wpt-export-bot merged commit 8686b7a into master Nov 9, 2024
20 checks passed
@chromium-wpt-export-bot chromium-wpt-export-bot deleted the chromium-export-cl-5964816 branch November 9, 2024 02:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants