Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WebNN should support int8 quantized models #128

Closed
wchao1115 opened this issue Dec 14, 2020 · 14 comments
Closed

WebNN should support int8 quantized models #128

wchao1115 opened this issue Dec 14, 2020 · 14 comments

Comments

@wchao1115
Copy link
Collaborator

Supporting int8 quantized models is essential for mobile scenarios and in many NPU architectures. TensorFlow (Lite) and ONNX, for instances, have int8 quantization support built-in, and WebNN should to. Related #93

@anssiko
Copy link
Member

anssiko commented Feb 11, 2022

@wchao1115 @huningxin do you think we should label this as "cr" for #240 purposes?

@huningxin
Copy link
Contributor

I think this is important one and support to label as "cr".

@anssiko
Copy link
Member

anssiko commented Feb 24, 2022

@wchao1115 this issue was on the agenda today, but we had to defer due to timing. Let us know your thoughts. I'm planning to bring this up for our next meeting for discussion.

@anssiko anssiko added the cr label Mar 24, 2022
@anssiko
Copy link
Member

anssiko commented Mar 24, 2022

Per discussion at https://www.w3.org/2022/03/24-webmachinelearning-minutes.html#t06 we consider this to be in scope for CR.

@anssiko
Copy link
Member

anssiko commented Sep 28, 2022

We've discussed this feature on our recent meetings:
https://www.w3.org/2022/09/22-webmachinelearning-minutes.html#t05
https://www.w3.org/2022/09/08-webmachinelearning-minutes.html#t05
https://www.w3.org/2022/08/25-webmachinelearning-minutes.html#t06

I will label this issue as "v2" due to required implementation experience for the initial CR inclusion. There's a mechanism for us to publish a Candidate Recommendation Draft subsequent to the initial CR that would give us adequate time to properly define, develop and test this feature.

Furthermore, we should soon start discussing WebNN "v2" plan as we look to extend our current charter and this feature could be one concrete feature to highlight. We can continue discuss this feature on our bi-weekly calls when there's new information and revise our position as appropriate.

@anssiko anssiko added v2 and removed cr labels Sep 28, 2022
aarongable pushed a commit to chromium/chromium that referenced this issue Dec 12, 2022
An XNNPACK Subgraph uses Values to represent the tensor data produced
and consumed by Nodes. This CL implements the methods that help define
XNNPACK Values for MLOperands. That includes the external Values for
graph’s input and output operands, the static Values for constant
operands and internal Values for intermediate operands. These methods
are used by MLGraphXnnpack::CreateXnnSubgraphAndRuntime() method that
visits graph’s operators in topological order and defines XNNPACK Values
for the input and output operands of each operator.

This CL initially supports defining XNNPACK Values for float32 and
float16 MLOperandType. The quantized integer types support will be
implemented as a WebNN V2 feature [1].

This CL also implements the DefineXnnpackValuesTest that covers the
definitions of different types of XNNPACK Values in various WebNN graph
topology.

[1]: webmachinelearning/webnn#128

Bug: 1273291
Change-Id: I3e9ec7e7524705bdf436ef8bf5c07f6b072c2dae
Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/3977358
Commit-Queue: ningxin hu <[email protected]>
Reviewed-by: Jiewei Qian <[email protected]>
Cr-Commit-Position: refs/heads/main@{#1082138}
@inexorabletash
Copy link
Member

It looks like this was added to the spec in 0970115 and we may have some implementation experience at this point. Close, despite it being marked v2 ?

@huningxin
Copy link
Contributor

The int8 quantized models may need some extra ops, for example DynamicQuantizeLinear, DequantizeLinear, ConvInteger and MatMulInteger, that are missed in current spec.

Transformer Models Analysis spread sheet has more details of ops required by int8 quantized model (see columns marked with (int8)).

@fdwr @Honry

@fdwr
Copy link
Collaborator

fdwr commented Feb 22, 2024

The int8 quantized models may need some extra ops, for example DynamicQuantizeLinear, DequantizeLinear, ConvInteger and MatMulInteger, that are missed in current spec.

Indeed, I have those 4 prototyped here (a minimal first four): https://github.com/fdwr/chromium-src-webnn-dml/pull/1/files#diff-e1b2517a6ae8f7c4494c75d17c8650b56e4f8d430f54f5e1f765475f00a5e1f3R427-R433

@wacky6
Copy link

wacky6 commented Mar 14, 2024

Seems int4 quantization is also a thing (with negligible impact on output quality). int4 practically halfs the VRAM requirement of the model, and offers a speedup on devices that support them.

Example of a int4 quantization model: https://huggingface.co/01-ai/Yi-6B-Chat-4bits

Should this be considered for v2? Or is int4 too specific? (I'm not sure if 4bit is adequate for image or audio models)

// There's a more aggressive {-1,0,1} quantization. It's fairly new, and I believe it's application is limited to language models.

@inexorabletash
Copy link
Member

The BitNet paper was really cool. https://arxiv.org/abs/2310.11453

@inexorabletash
Copy link
Member

The int8 quantized models may need some extra ops, for example DynamicQuantizeLinear, DequantizeLinear, ConvInteger and MatMulInteger, that are missed in current spec.

Indeed, I have those 4 prototyped here (a minimal first four): https://github.com/fdwr/chromium-src-webnn-dml/pull/1/files#diff-e1b2517a6ae8f7c4494c75d17c8650b56e4f8d430f54f5e1f765475f00a5e1f3R427-R433

Hey @fdwr - how fresh is your prototype of these? And have you looked at how other backends (CoreML, TFLite) would implement these? Starting with the "minimum viable" quantization support as outlined in #623 is appealing!

@fdwr
Copy link
Collaborator

fdwr commented Aug 2, 2024

@inexorabletash

how fresh is your prototype

It's moldy bread by now (but snippets could be reused). The ORT WebNN EP implementation still exists (it was originally added during prototyping) and would light up again once the op is added into Chromium.

And have you looked at how other backends (CoreML, TFLite)

There are differences, but they should be expressible (🤞). For dequantization, most decompose to output = mul(sub(input, zeroPoint), scale) (except TF full, CoreML MIL's LUT mode, and CoreML's scale&bias form). They have differing broadcasting rules, which I'd like to iron out to be more consistent (consistent with unidirectional broadcasting of its decomposition and expand).

API Name Equation Types
TFLite DequantizeOp real = (quantized - zeroPoint) * scale (link) input: uint4, uint8, int8, int16, float16
zeroPoint: uint8
scale: float32
output: float32
TF tf.quantization.dequantize output = minRange + (input * (maxRange - minRange) / dataTypeRange) input: uint8
minRange: float32
maxRange: float32
dataTypeRange: int
output: float32
CoreML MIL constexpr_affine_dequantize real = (input - zeroPoint) * scale input: uint8, int8
zeroPoint: uint8, int8, float32
scale: same as output
output: float16, float32
CoreML MIL constexpr_lut_to_dense real = lut[input] input: uint1, uint2, uint4, uint6, uint8
output: uint8, int8, float16, float32
CoreML LinearQuantizationParams ? input * scale + bias ? input: ?
scale: float32
bias: float32
output: ?
ONNX DequantizeLinear real = (input - zeroPoint) * scale input: uint4, int4, uint8, int8, uint16, int16, int32, float8e4m3fn, float8e4m3fnuz, float8e5m2, float8e5m2fnuz
zeroPoint: same as input
scale: same as output
output: bfloat16, float16, float32
DML DEQUANTIZE_LINEAR real = (input - zeroPoint) * scale input: uint4, int4, uint8, int8, uint16, int16, uint32, int32
zeroPoint: same as input
scale: same as output
output: float16, float32

@reillyeon
Copy link
Contributor

Discussed at the TPAC 2024 F2F. Group consensus was to implement QDQ operators for int8 and int4. Deduplicating with #93.

@reillyeon
Copy link
Contributor

Closing this issue. Discussion of quantization operators should continue on #93.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants