Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quantization tool: support float 8 with MatMul, support float 16 weights #18043

Merged
merged 47 commits into from
Jan 12, 2024

Conversation

xadupre
Copy link
Member

@xadupre xadupre commented Oct 20, 2023

Description

Whenever a node QuantizeLinear or DequantizeLinear, the type of the weights before being quantize must be known to create the scale with the expected type. Another option would be to add many operator CastLike but that would push the burden to onnxruntime optimizer.

The PR tries to avoid changing the signature. To do so, it modified the scale computation to use a numpy array to store the result and not a python float. The numpy array must be of the same type than the weights to quantize.

The PR adds many assert to check the type of the scale is not a python type or a float64. This was added to make sure all the code follows the same logic. These lines were kept for the first review.

DequantizeLinear, QuantizeLinear cannot be tested with onnx==1.15. PR onnx/onnx#5709 is missing to fix shape inference. PR onnx/onnx#5473) is missing to support QLinearMatMul with float 16. That explains why some tests are disabled with float 16.

Motivation and Context

The current quantization tool assumes every weight is float 32. For large models such as LLAMA, it is usually float 16. The quantization needs to quantize such weights.

@xadupre xadupre marked this pull request as ready for review October 27, 2023 16:56
Copy link
Member

@yufenglee yufenglee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xadupre xadupre merged commit c8399a8 into microsoft:main Jan 12, 2024
87 checks passed
adrianlizarraga added a commit that referenced this pull request Jan 13, 2024
…ta types (#19114)

### Description
- Updates `get_qnn_qdq_config()` to use new scale/zp np.array data
types.
- Adds missing unit test to help prevent future regression.



### Motivation and Context
#18043 changed the usage of
`extra_options["TensorQuantizationOverrides"]`. We need to update its
use in quantization/execution_providers/qnn/quant_config.py
mszhanyi pushed a commit that referenced this pull request Jan 15, 2024
…hts (#18043)

### Description

Whenever a node QuantizeLinear or DequantizeLinear, the type of the
weights before being quantize must be known to create the scale with the
expected type. Another option would be to add many operator CastLike but
that would push the burden to onnxruntime optimizer.

The PR tries to avoid changing the signature. To do so, it modified the
scale computation to use a numpy array to store the result and not a
python float. The numpy array must be of the same type than the weights
to quantize.

The PR adds many `assert` to check the type of the scale is not a python
type or a float64. This was added to make sure all the code follows the
same logic. These lines were kept for the first review.

DequantizeLinear, QuantizeLinear cannot be tested with onnx==1.15. PR
onnx/onnx#5709 is missing to fix shape
inference. PR onnx/onnx#5473) is missing to
support QLinearMatMul with float 16. That explains why some tests are
disabled with float 16.

### Motivation and Context

The current quantization tool assumes every weight is float 32. For
large models such as LLAMA, it is usually float 16. The quantization
needs to quantize such weights.
mszhanyi pushed a commit that referenced this pull request Jan 15, 2024
…ta types (#19114)

### Description
- Updates `get_qnn_qdq_config()` to use new scale/zp np.array data
types.
- Adds missing unit test to help prevent future regression.



### Motivation and Context
#18043 changed the usage of
`extra_options["TensorQuantizationOverrides"]`. We need to update its
use in quantization/execution_providers/qnn/quant_config.py
guotuofeng pushed a commit to microsoft/Olive that referenced this pull request Jan 17, 2024
## Describe your changes
PR microsoft/onnxruntime#18043 (onnxruntime)
extends onnxruntime quantization tools to support float16 weights. To do
so, it enforces scale and zerop_point to be strongly typed (as
`numpy.array(single_value, dtype=dtype)`). scale type should always be
the weight type, and zero_point type the quantized weight type. That
convention is checked all along the quantization tools to make sure
there is loss of information. This change was made to avoid adding new
arguments in many functions to carry this information.

## Checklist before requesting a review
- [ ] Add unit tests for this change.
- [ ] Make sure all tests can pass.
- [ ] Update documents if necessary.
- [ ] Lint and apply fixes to your code by running `lintrunner -a`
- [ ] Is this a user-facing change? If yes, give a description of this
change to be included in the release notes.

## (Optional) Issue link
xadupre added a commit to xadupre/onnxruntime that referenced this pull request Jan 17, 2024
xadupre added a commit that referenced this pull request Jan 17, 2024
…19182)

### Description
Extends the code coverage to Entroy, Histogram and Distribution
calibration method, fix bugs while doing it.



### Motivation and Context
Bugs detected in [Olive](https://github.com/microsoft/OLive).
YUNQIUGUO pushed a commit that referenced this pull request Jan 23, 2024
…19182)

### Description
Extends the code coverage to Entroy, Histogram and Distribution
calibration method, fix bugs while doing it.



### Motivation and Context
Bugs detected in [Olive](https://github.com/microsoft/OLive).
@xadupre xadupre deleted the qdqmm branch November 7, 2024 10:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants