-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WebNN should support int8 quantized models #128
Comments
@wchao1115 @huningxin do you think we should label this as "cr" for #240 purposes? |
I think this is important one and support to label as "cr". |
@wchao1115 this issue was on the agenda today, but we had to defer due to timing. Let us know your thoughts. I'm planning to bring this up for our next meeting for discussion. |
Per discussion at https://www.w3.org/2022/03/24-webmachinelearning-minutes.html#t06 we consider this to be in scope for CR. |
We've discussed this feature on our recent meetings: I will label this issue as "v2" due to required implementation experience for the initial CR inclusion. There's a mechanism for us to publish a Candidate Recommendation Draft subsequent to the initial CR that would give us adequate time to properly define, develop and test this feature. Furthermore, we should soon start discussing WebNN "v2" plan as we look to extend our current charter and this feature could be one concrete feature to highlight. We can continue discuss this feature on our bi-weekly calls when there's new information and revise our position as appropriate. |
An XNNPACK Subgraph uses Values to represent the tensor data produced and consumed by Nodes. This CL implements the methods that help define XNNPACK Values for MLOperands. That includes the external Values for graph’s input and output operands, the static Values for constant operands and internal Values for intermediate operands. These methods are used by MLGraphXnnpack::CreateXnnSubgraphAndRuntime() method that visits graph’s operators in topological order and defines XNNPACK Values for the input and output operands of each operator. This CL initially supports defining XNNPACK Values for float32 and float16 MLOperandType. The quantized integer types support will be implemented as a WebNN V2 feature [1]. This CL also implements the DefineXnnpackValuesTest that covers the definitions of different types of XNNPACK Values in various WebNN graph topology. [1]: webmachinelearning/webnn#128 Bug: 1273291 Change-Id: I3e9ec7e7524705bdf436ef8bf5c07f6b072c2dae Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/3977358 Commit-Queue: ningxin hu <[email protected]> Reviewed-by: Jiewei Qian <[email protected]> Cr-Commit-Position: refs/heads/main@{#1082138}
It looks like this was added to the spec in 0970115 and we may have some implementation experience at this point. Close, despite it being marked v2 ? |
The int8 quantized models may need some extra ops, for example Transformer Models Analysis spread sheet has more details of ops required by int8 quantized model (see columns marked with (int8)). |
Indeed, I have those 4 prototyped here (a minimal first four): https://github.com/fdwr/chromium-src-webnn-dml/pull/1/files#diff-e1b2517a6ae8f7c4494c75d17c8650b56e4f8d430f54f5e1f765475f00a5e1f3R427-R433 |
Seems int4 quantization is also a thing (with negligible impact on output quality). int4 practically halfs the VRAM requirement of the model, and offers a speedup on devices that support them. Example of a int4 quantization model: https://huggingface.co/01-ai/Yi-6B-Chat-4bits Should this be considered for v2? Or is int4 too specific? (I'm not sure if 4bit is adequate for image or audio models) // There's a more aggressive {-1,0,1} quantization. It's fairly new, and I believe it's application is limited to language models. |
The BitNet paper was really cool. https://arxiv.org/abs/2310.11453 |
Hey @fdwr - how fresh is your prototype of these? And have you looked at how other backends (CoreML, TFLite) would implement these? Starting with the "minimum viable" quantization support as outlined in #623 is appealing! |
It's moldy bread by now (but snippets could be reused). The ORT WebNN EP implementation still exists (it was originally added during prototyping) and would light up again once the op is added into Chromium.
There are differences, but they should be expressible (🤞). For dequantization, most decompose to
|
Discussed at the TPAC 2024 F2F. Group consensus was to implement QDQ operators for int8 and int4. Deduplicating with #93. |
Closing this issue. Discussion of quantization operators should continue on #93. |
Supporting int8 quantized models is essential for mobile scenarios and in many NPU architectures. TensorFlow (Lite) and ONNX, for instances, have int8 quantization support built-in, and WebNN should to. Related #93
The text was updated successfully, but these errors were encountered: