[Build] How can I quantize the llama3 model activation to int4 ? #21334
Labels
build
build issues; typically submitted using template
model:transformer
issues related to a transformer model: BERT, GPT2, Hugging Face, Longformer, T5, etc.
quantization
issues related to quantization
stale
issues that have not been addressed in a while; categorized by a bot
Describe the issue
I’m trying to quantize a int4 model, but this file only provides the weight-only-quantization. If I can quantize both weight and activation to int4 ?
https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/quantization/matmul_4bits_quantizer.py
Thanks for your help!
Urgency
No response
Target platform
onnx
Build script
python -m onnxruntime.transformers.models.llama.convert_to_onnx -m /publicdata/huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct/ --output llama3-8b-int4-gpu --precision int4 --execution_provider cuda --quantization_method blockwise --use_gqa
Error / output
except can quantize both weight and activation
Visual Studio Version
No response
GCC / Compiler Version
No response
The text was updated successfully, but these errors were encountered: