-
Notifications
You must be signed in to change notification settings - Fork 289
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to use 4-bit AWQ? #1776
Comments
Follow-up questions... Here's a script I used to quantize using AWQ. Note the usage of MY CONVERSION SCRIPT
Regardless, however, my questions are:
Again, examples and a more thorough explanation of how to use AWQ in the |
I might have answered my own question, but can you confirm? My understanding is that you can quantize (using a dataset for validation nonetheless) per these instructions: https://casper-hansen.github.io/AutoAWQ/examples/ ...and then convert with Thanks. |
Thanks for pointing that out! I’ll update the documentation to clarify things soon. To clarify, int32_float16 is just the internal compute type used with AWQ models. You don't need to specify it when generating tokens—it will default to int32_float16 automatically. Following theses steps: Step 1: You can either use an AWQ quantized model from Hugging Face (as shown in the example) or quantize one yourself using this guide. Then, convert the AWQ quantized model to a CT2 model as described in the documentation. Step 2: Run inference as you would with other models in CT2, just by specifying the model path. |
Why don't you say it just like you did to me now, but in the documentation. ;-) And then give an example or two as well. I'm learning to convert to AWQ...When converting it's possible to use a calibration dataset as well as specify a "version" regarding the type of conversion. The AutoAWQ docs mention "Marlin" but, if I understand correctly, ct2 4.4 only supports gemm and gemv? Will it simply not run correctly if I quantize using the marlin kernel? To further complicate matters, when running the model (with autoawq) I can specify a "version" such as "gemm," "gemv" or "exllama." The documentation says that you can only use "exllama" with a model that has been converted using "gemm". I'm confused by all of this. How does it relate to running on ctranslate2? Here's the relevant portion of my conversion script I'm referring to: SCRIPT HERE
|
Our version only supports gemm and gemv, so you'll need to choose between those two. For now, I believe these options are sufficient. When running with CT2, you don’t need to specify gemm or gemv—it will be automatically detected based on the weight format. |
And can I run an AWQ model that's of an architecture that Ctranslate2 doens't normally support. For example, some of the Zephyr models can be converted to Ctranslate2 but they can be quantized using AWQ. Is it now possible to use them on Ctranslate2? |
No, It works only with models supported by Ctranslate2. |
Anxiously awaiting the updated documentation to test further. Thanks. |
Any update on this? |
In reviewing the updated
docs
I notice a few things that prompted some questions...int32_float16
are mentioned in the "Quantize on model conversion" nor "Quantize on model loading" sections here:https://opennmt.net/CTranslate2/quantization.html
ct2-transformers-converter --model TheBloke/Llama-2-7B-AWQ --copy_files tokenizer.model --output_dir ct2_model
This was confusing to me because by using "we" it implies that
ctranslate2
itself can quantize a model to AWQ format. Is this the case or not?Is it still true that even if a model is in AWQ format, it will still only be runnable if it originated from one of the model architectures that
ctranslate2
supports? This is probably a kind of stupid question but wanted to doublecheck...Can we please get at least one example of how to actually run a model using 4-bit AWQ. I was not able to find a simple example, especially one using a transformers-based model.
Thanks yet again!
The text was updated successfully, but these errors were encountered: