Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quantized SeaLLM v2 Model Outputs Same as Input #21636

Open
sabre-code opened this issue Aug 6, 2024 · 1 comment
Open

Quantized SeaLLM v2 Model Outputs Same as Input #21636

sabre-code opened this issue Aug 6, 2024 · 1 comment
Labels
model:transformer issues related to a transformer model: BERT, GPT2, Hugging Face, Longformer, T5, etc. performance issues related to performance regressions quantization issues related to quantization

Comments

@sabre-code
Copy link

Describe the issue

We encountered an issue while using SeaLLM v2, a 7B model, in ONNX format with int8 quantization for translation purposes. Here are the steps we followed and the problem we're facing:

  1. Model Conversion to ONNX:

    • We used the Optimum CLI to convert SeaLLM v2 into ONNX format.
    • The conversion resulted in a full precision (fp32) ONNX model.
  2. Model Quantization:

    • We applied the quantize_dynamic() function to convert the fp32 model to int8.
    • The quantization process completed without errors.
  3. Issue:

    • When using the quantized model for translation, the output is identical to the input.
    • This issue is not isolated to SeaLLM v2; we have faced similar problems with other model quantizations like TinyLlama.

To reproduce

Steps to Reproduce:

  1. Convert SeaLLM v2 to ONNX using the Optimum CLI.
  2. Quantize the ONNX model from fp32 to int8 using quantize_dynamic().
  3. Use the quantized model for a translation task.
  4. Observe that the output is the same as the input.

Urgency

No response

Platform

Linux

OS Version

Ubutu 20.04.6

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.18.1

ONNX Runtime API

Python

Architecture

X64

Execution Provider

Default CPU

Execution Provider Library Version

No response

Model File

No response

Is this a quantized model?

Yes

@sabre-code sabre-code added the performance issues related to performance regressions label Aug 6, 2024
@github-actions github-actions bot added the quantization issues related to quantization label Aug 6, 2024
@sophies927 sophies927 added the model:transformer issues related to a transformer model: BERT, GPT2, Hugging Face, Longformer, T5, etc. label Aug 8, 2024
@yufenglee
Copy link
Member

@sabre-code, could you please try running this model with onnxruntime-genai?
And here is the example to create the model and run the similar model:
https://github.com/microsoft/onnxruntime-genai/blob/main/examples/python/README.md#get-the-model

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
model:transformer issues related to a transformer model: BERT, GPT2, Hugging Face, Longformer, T5, etc. performance issues related to performance regressions quantization issues related to quantization
Projects
None yet
Development

No branches or pull requests

3 participants