Quantization failed! The onnxruntime.quantization.quantize_dynamic seems didn't convert to the qint8 .onnx file successfully #21440
Labels
quantization
issues related to quantization
stale
issues that have not been addressed in a while; categorized by a bot
Describe the issue
before converting
trying to convert qwen2 1.5B to int8onnx file using the code below,originally the size of this fp16 model file is 3GB, and I successfully convert it to FP32(onnx export default) onnx file and tested it, but there are something wrong with the int8 convertion
model.onnx is an .onnx file with weights files in the same directory. And the total size is about 6GB(4 times of 1.5B params), and the expected size of int8 onnx file should be around 1.5GB.
after converting, the output
root@I1b52bb69840070157e:~/output_int# ls -la
total 6293056
drwxr-xr-x 2 root root 65 Jul 22 17:05 .
drwx------ 1 root root 4096 Jul 22 17:04 ..
-rw-r--r-- 1 root root 11008 Jul 22 17:05 logits_int8.onnx
-rw-r--r-- 1 root root 6444070485 Jul 22 17:05 model_int8.onnx
obviously the size is still the same as FP32 model, I suppose that the reason may related to the file size/param size
then I passed
use_external_data_format=True
to make it save the weight file separately, but it also didn't workTo reproduce
Convert a decoder-only large model like qwen2 or llama3 may reproduce,
and as I tested, the openai/whisper model(when export encoder and decoder respectively) have no issue like that.
Urgency
There is a deadline 7/24 for my project, anyone know how to solve this?
Platform
Linux
OS Version
ubuntu22.04
ONNX Runtime Installation
Released Package
ONNX Runtime Version or Commit ID
latest version(1.18.1)
ONNX Runtime API
Python
Architecture
X64
Execution Provider
Default CPU, CUDA
Execution Provider Library Version
cuda 12.1 onnxruntime(latest), opset 17
The text was updated successfully, but these errors were encountered: