This document provides a quantization support matrix for the following frameworks listed below:
Framework | Backend Library | Symmetric Quantization | Asymmetric Quantization |
---|---|---|---|
TensorFlow | oneDNN | Activation (int8/uint8), Weight (int8) | - |
PyTorch | FBGEMM | Activation (uint8), Weight (int8) | Activation (uint8) |
PyTorch IPEX | oneDNN | Activation (int8/uint8), Weight (int8) | - |
MXNet | oneDNN | Activation (int8/uint8), Weight (int8) | - |
ONNX Runtime | MLAS | Weight (int8) | Activation (uint8) |
- Symmetric Quantization
- int8: scale = 2 * max(abs(rmin), abs(rmax)) / (max(int8) - min(int8) - 1)
- uint8: scale = max(rmin, rmax) / (max(uint8) - min(uint8))
- Symmetric Quantization
- int8: scale = max(abs(rmin), abs(rmax)) / (float(max(int8) - min(int8)) / 2)
- uint8: scale = max(abs(rmin), abs(rmax)) / (float(max(int8) - min(int8)) / 2)
- Asymmetric Quantization
- uint8: scale = (rmax - rmin) / (max(uint8) - min(uint8)); zero_point = min(uint8) - round(rmin / scale)
- Symmetric Quantization
- int8: scale = 2 * max(abs(rmin), abs(rmax)) / (max(int8) - min(int8) - 1)
- uint8: scale = max(rmin, rmax) / (max(uint8) - min(uint8))
- Symmetric Quantization
- int8: scale = 2 * max(abs(rmin), abs(rmax)) / (max(int8) - min(int8) - 1)
- uint8: scale = max(rmin, rmax) / (max(uint8) - min(uint8))
- Symmetric Quantization
- int8: scale = 2 * max(abs(rmin), abs(rmax)) / (max(int8) - min(int8) - 1)
- Asymmetric Quantization
- uint8: scale = (rmax - rmin) / (max(uint8) - min(uint8)); zero_point = min(uint8) - round(rmin / scale)