[Web] Quantized model decreases in size, but takes same amount of inference time as non-quantized model #21535
Labels
api:Javascript
issues related to the Javascript API
platform:web
issues related to ONNX Runtime web; typically submitted using template
quantization
issues related to quantization
stale
issues that have not been addressed in a while; categorized by a bot
Describe the issue
I have a transformer model from which I'm exporting all the modules (i.e. source embedding, positional encoding, encoder, decoder, projection layer etc) separately to onnx. For simplicity, I am going to focus on just one module - the encoder. The non-quantized encoder module was sized 75.7 MB and it took around 110 milliseconds for inference in onnx runtime javascript. I used the following code to quantize the module -
The generated quantized model is of size 19.2 MB. However, the web inference is still taking roughly the same time, meaning the quantization has not had an impact in inference time.
This is the inference code -
This is the session configuration -
Why is the quantized model smaller, but take the same time to infer as the non-quantized model?
To reproduce
Unfortunately, the onnx files are too big to upload here.
Urgency
No response
ONNX Runtime Installation
Built from Source
ONNX Runtime Version or Commit ID
ONNX Runtime Web v1.18.0
Execution Provider
'wasm'/'cpu' (WebAssembly CPU)
The text was updated successfully, but these errors were encountered: