In this section, we explore the advanced quantization capabilities of the Quark quantizer, designed to recover the lost accuracy in quantized models. While basic quantization configurations are effective for many models, advanced and optimized models often require sophisticated techniques to enhance the accuracy of the quantized versions. This guide will walk you through these advanced methods, ensuring your models maintain high performance even after quantization.
This tutorials takes MobileNetV2 onnx model as an example, which can be challenging to quantize with minimal accuracy loss, using the advanced quark quantization tools.
Please ensure to setup the validation and calibration datase using the instruction from Quark Quantization Tutorial
MobileNet: Using XINT8
configuration, we see a drop of ~3% in the Top-1 accuracy. Optimized models like MobileNetV2 tend to be more difficult to quantize. To bridge the gap between float and quantized accuracy of the model, we can use some advanced quantization configurations or techniques.
python advanced_quark_quantize.py --model_input models/mobilenetv2.onnx --model_output models/mobilenetv2_quant.onnx
MobileNetV2 | Model Size | Top-1 Accuracy | Top-5 Accuracy |
---|---|---|---|
Float 32 | 13.34 MB | 71.3% | 90.6% |
INT8 (CPU) | 3.44 MB | 64.0% | 86.5% |
INT8 (NPU) | 3.44 MB | 63.7% | 87.0% |
ResNet50: Using XINT8
configuration
python advanced_quark_quantize.py --model_input models/resnet50.onnx --model_output models/resnet50_quant.onnx
ResNet50 | Model Size | Top-1 Accuracy | Top-5 Accuracy |
---|---|---|---|
Float 32 | 97.41 MB | 80.0% | 96.1% |
INT8 (CPU) | 24.46 MB | 77.3% | 94.9% |
INT8 (NPU) | 24.46 MB | 77.4% | 95.2% |
Fast fine-tuning involves adjusting a pre-trained model to enhance its accuracy after quantization. This approach helps recover accuracy lost during quantization, making the model more suitable for deployment.
INT8_CNN_ACCURATE_CONFIG = QuantizationConfig(calibrate_method=CalibrationMethod.Percentile,
activation_type=QuantType.QUInt8,
weight_type=QuantType.QInt8,
include_fast_ft=True,
extra_options={
'Percentile': 99.9999,
'FastFinetune': DEFAULT_ADAROUND_PARAMS
})
config = Config(global_quant_config=INT8_CNN_ACCURATE_CONFIG)
MobileNet: Using INT8_CNN_ACCURATE
configuration, which improve the accuracy of the model through Fast Fine-tuning
and Histogram Percentile
based techniques.
python advanced_quark_quantize.py --model_input models/mobilenetv2.onnx --model_output models/mobilenetv2_quant.onnx --fast_finetune
MobileNetV2 | Model Size | Top-1 Accuracy | Top-5 Accuracy |
---|---|---|---|
Float 32 | 13.32 | 71.3% | 90.6% |
INT8 (CPU) | 3.43 | 70.5% | 90.3% |
INT8 (NPU) | 3.43 | 69.6% | 89.4% |
ResNet50: Using ```Fast Fine-tuning`` configuration
python advanced_quark_quantize.py --model_input models/resnet50.onnx --model_output models/resnet50_quant.onnx --fast_finetune
ResNet50 | Model Size | Top-1 Accuracy | Top-5 Accuracy |
---|---|---|---|
Float 32 | 97.41 MB | 80.0% | 96.1% |
INT8 (CPU) | 24.46 MB | 79.3% | 96.2% |
INT8 (NPU) | 24.46 MB | 77.4% | 95.2% |
Cross-Layer Equalization (CLE) optimizes neural networks for quantization by balancing weight distributions across layers, reducing quantization errors. This technique helps maintain model accuracy while enabling efficient quantization.
INT8_CLE_CONFIG = QuantizationConfig(calibrate_method=PowerOfTwoMethod.MinMSE,
activation_type=QuantType.QUInt8,
weight_type=QuantType.QInt8,
enable_npu_cnn=True,
include_cle=True,
extra_options={'ActivationSymmetric': True})
config = Config(global_quant_config=INT8_CLE_CONFIG)
MobileNet: Using Cross Layer Equalization
configuration
python advanced_quark_quantize.py --model_input models/mobilenetv2.onnx --model_output models/mobilenetv2_quant.onnx --cross_layer_equalization
MobileNetV2 | Model Size | Top-1 Accuracy | Top-5 Accuracy |
---|---|---|---|
Float 32 | 13.32 | 71.3% | 90.6% |
INT8 (CPU) | 3.43 | 62.7% | 85.4% |
INT8 (NPU) | 3.43 | 63.7% | 86.4% |
ResNet50: Using Cross Layer Equalization
configuration
python advanced_quark_quantize.py --model_input models/resnet50.onnx --model_output models/resnet50_quant.onnx --cross_layer_equalization
ResNet50 | Model Size | Top-1 Accuracy | Top-5 Accuracy |
---|---|---|---|
Float 32 | 97.41 MB | 80.0% | 96.1% |
INT8 (CPU) | 24.46 MB | 77.7% | 95.7% |
INT8 (NPU) | 24.46 MB | 78.2% | 95.6% |
For more details on the Quark API features in the Quark Documentation