Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Web] Quantized model decreases in size, but takes same amount of inference time as non-quantized model #21535

Open
kabyanil opened this issue Jul 28, 2024 · 4 comments
Labels
api:Javascript issues related to the Javascript API platform:web issues related to ONNX Runtime web; typically submitted using template quantization issues related to quantization stale issues that have not been addressed in a while; categorized by a bot

Comments

@kabyanil
Copy link

Describe the issue

I have a transformer model from which I'm exporting all the modules (i.e. source embedding, positional encoding, encoder, decoder, projection layer etc) separately to onnx. For simplicity, I am going to focus on just one module - the encoder. The non-quantized encoder module was sized 75.7 MB and it took around 110 milliseconds for inference in onnx runtime javascript. I used the following code to quantize the module -

# encoder
quantize_dynamic(
    model_input=f'{common_dir}/encoder.onnx',
    model_output=f'{common_dir}/quantized/encoder.onnx',
    weight_type=QuantType.QUInt8,
)

The generated quantized model is of size 19.2 MB. However, the web inference is still taking roughly the same time, meaning the quantization has not had an impact in inference time.

This is the inference code -

 const src_encoder_out = await session.src_encode.run({
    input_1: src_pos_out,
    input_2: src_mask,
 }).then((res) => res[871])

This is the session configuration -

const sessionOptions = {
               executionProviders: ['wasm'],
               enableCpuMemArena: true,
               // enableGraphCapture: true,
               executionMode: "parallel",
               enableMemPattern: true,
               intraOpNumThreads: 4,
               graphOptimizationLevel: "extended"
            }

            // create the session variable
            const session = {
               ...
               src_encode: await ort.InferenceSession.create("./models/encoder.onnx", sessionOptions),
               ...
            }

Why is the quantized model smaller, but take the same time to infer as the non-quantized model?

To reproduce

Unfortunately, the onnx files are too big to upload here.

Urgency

No response

ONNX Runtime Installation

Built from Source

ONNX Runtime Version or Commit ID

ONNX Runtime Web v1.18.0

Execution Provider

'wasm'/'cpu' (WebAssembly CPU)

@kabyanil kabyanil added the platform:web issues related to ONNX Runtime web; typically submitted using template label Jul 28, 2024
@github-actions github-actions bot added api:Javascript issues related to the Javascript API quantization issues related to quantization labels Jul 28, 2024
@gyagp
Copy link

gyagp commented Jul 30, 2024

Weight quantization may save IO, but may not impact the inference time obviously as the underlying compute is still FP32. If you need more performance, can you try WebGPU EP? If it doesn't work well as expected, please share the model and web app to run it.

@kabyanil
Copy link
Author

kabyanil commented Aug 3, 2024

My target environment may not facilitate GPUs. Therefore I cannot resort to WebGPU. What is your opinion on onnx runtime web vs tfjs in terms of CPU performance?

@gyagp
Copy link

gyagp commented Aug 13, 2024

I think you mean WASM (TFJS also has a CPU backend written in TypeScript, in addition to WASM backend written in C++), but I don't have a concrete idea about their perf comparison. BTW, SIMD and Multi-threading usually bring a lot of perf gain for WASM.

Copy link
Contributor

This issue has been automatically marked as stale due to inactivity and will be closed in 30 days if no further activity occurs. If further support is needed, please provide an update and/or more details.

@github-actions github-actions bot added the stale issues that have not been addressed in a while; categorized by a bot label Sep 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api:Javascript issues related to the Javascript API platform:web issues related to ONNX Runtime web; typically submitted using template quantization issues related to quantization stale issues that have not been addressed in a while; categorized by a bot
Projects
None yet
Development

No branches or pull requests

2 participants