How to use 4-bit AWQ? #1776

BBC-Esq · 2024-09-10T16:00:39Z

In reviewing the updated docs I notice a few things that prompted some questions...

Neither AWQ/Int-4/int32_float16 are mentioned in the "Quantize on model conversion" nor "Quantize on model loading" sections here:

https://opennmt.net/CTranslate2/quantization.html

Near the bottom it gives some helpful information, but it states "We have to quantize the model with AWQ first..." However, the code snippet gives is, apparently, to convert to CT2 format a model "already" in AWQ format (by TheBloke):

ct2-transformers-converter --model TheBloke/Llama-2-7B-AWQ --copy_files tokenizer.model --output_dir ct2_model

This was confusing to me because by using "we" it implies that ctranslate2 itself can quantize a model to AWQ format. Is this the case or not?

Is it still true that even if a model is in AWQ format, it will still only be runnable if it originated from one of the model architectures that ctranslate2 supports? This is probably a kind of stupid question but wanted to doublecheck...
Can we please get at least one example of how to actually run a model using 4-bit AWQ. I was not able to find a simple example, especially one using a transformers-based model.

Thanks yet again!

The text was updated successfully, but these errors were encountered:

BBC-Esq · 2024-09-10T16:30:45Z

Follow-up questions...

Here's a script I used to quantize using AWQ. Note the usage of q_group_size and w_bit." The API may have changed because HERE they use group_size and bits instead.

MY CONVERSION SCRIPT

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = r"D:\Scripts\benchmark_chat\models\Mistral-7B-Instruct-v0.3"
quant_path = r"D:\Scripts\benchmark_chat\models\Mistral-7B-Instruct-v0.3-AWQ"

# model_path = r"D:\Scripts\benchmark_chat\models\Mistral-7B-Instruct-v0.3"
# quant_path = r"D:\Scripts\benchmark_chat\models\Mistral-7B-Instruct-v0.3-AWQ"

# model_path = r"D:\Scripts\benchmark_chat\models\Mistral-7B-Instruct-v0.3"
# quant_path = r"D:\Scripts\benchmark_chat\models\Mistral-7B-Instruct-v0.3-AWQ"

quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "gemm"}


# Load model
model = AutoAWQForCausalLM.from_pretrained(model_path, trust_remote_code=True, low_cpu_mem_usage=True, use_cache=True)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

# Quantize
model.quantize(tokenizer, quant_config=quant_config)

# Save quantized model
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)

'''
# group size can be 64 or 32 as well

The possible values for version in the Auto-AWQ library are:

"gemm": GEMM version of the quantization method.
"gemv": GEMV version of the quantization method.
"marlin": Marlin version of the quantization method.
"gemv_fast": Fast GEMV version of the quantization method.
'''

Regardless, however, my questions are:

First, if ctranslate2 doesn't convert to AWQ, do any of these parameters (or any other for that matter) matter as far as ctranslate2being able to convert and run model?
Second, if ctranslate2 DOES convert to AWQ, how can I specify these parameters (and any others) during the conversion step?

Again, examples and a more thorough explanation of how to use AWQ in the docs would be much appreciated. Thanks!

BBC-Esq · 2024-09-10T19:01:53Z

I might have answered my own question, but can you confirm?

My understanding is that you can quantize (using a dataset for validation nonetheless) per these instructions:

https://casper-hansen.github.io/AutoAWQ/examples/

...and then convert with ctranslate2? And then how would it be run...specifying int32_float16? What about int32_bfloat16?

Thanks.

minhthuc2502 · 2024-09-13T14:11:04Z

Thanks for pointing that out! I’ll update the documentation to clarify things soon.

To clarify, int32_float16 is just the internal compute type used with AWQ models. You don't need to specify it when generating tokens—it will default to int32_float16 automatically.

Following theses steps:

Step 1: You can either use an AWQ quantized model from Hugging Face (as shown in the example) or quantize one yourself using this guide. Then, convert the AWQ quantized model to a CT2 model as described in the documentation.

Step 2: Run inference as you would with other models in CT2, just by specifying the model path.

BBC-Esq · 2024-09-13T14:29:08Z

Thanks for pointing that out! I’ll update the documentation to clarify things soon.

To clarify, int32_float16 is just the internal compute type used with AWQ models. You don't need to specify it when generating tokens—it will default to int32_float16 automatically.

Following theses steps:

Step 1: You can either use an AWQ quantized model from Hugging Face (as shown in the example) or quantize one yourself using this guide. Then, convert the AWQ quantized model to a CT2 model as described in the documentation.

Step 2: Run inference as you would with other models in CT2, just by specifying the model path.

Why don't you say it just like you did to me now, but in the documentation. ;-) And then give an example or two as well.

I'm learning to convert to AWQ...When converting it's possible to use a calibration dataset as well as specify a "version" regarding the type of conversion. The AutoAWQ docs mention "Marlin" but, if I understand correctly, ct2 4.4 only supports gemm and gemv? Will it simply not run correctly if I quantize using the marlin kernel?

To further complicate matters, when running the model (with autoawq) I can specify a "version" such as "gemm," "gemv" or "exllama." The documentation says that you can only use "exllama" with a model that has been converted using "gemm". I'm confused by all of this.

How does it relate to running on ctranslate2? Here's the relevant portion of my conversion script I'm referring to:

SCRIPT HERE

model_path = r"D:\Scripts\bench_chat\models\Qwen2-7B-Instruct"
# quant_path = r"D:\Scripts\bench_chat\models\Qwen2-7B-Instruct-AWQ"
quant_path = r"D:\Scripts\bench_chat\models\Qwen2-7B-Instruct-AWQ-marlin"

# model_path = r"D:\Scripts\bench_chat\models\Yi-1.5-9B-Chat"
# quant_path = r"D:\Scripts\bench_chat\models\Yi-1.5-9b-Chat-AWQ"

# quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "gemm" }
# quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "gemv" }
# quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "gemv_fast" }
quant_config = { "zero_point": False, "q_group_size": 128, "w_bit": 4, "version": "marlin" }

# Load model
model = AutoAWQForCausalLM.from_pretrained(
    model_path, **{"low_cpu_mem_usage": True, "use_cache": False}
)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

def load_cosmopedia():
    data = load_dataset('HuggingFaceTB/cosmopedia-100k', split="train")
    data = data.filter(lambda x: x["text_token_length"] >= 2048)

    return [text for text in data["text"]]

model.quantize(
    tokenizer,
    quant_config=quant_config,
    calib_data=load_cosmopedia(),
    n_parallel_calib_samples=16,
    max_calib_samples=256,
    max_calib_seq_len=8192
)

model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)

print(f'Model is quantized and saved at "{quant_path}"')

minhthuc2502 · 2024-09-13T14:42:22Z

Our version only supports gemm and gemv, so you'll need to choose between those two. For now, I believe these options are sufficient.

When running with CT2, you don’t need to specify gemm or gemv—it will be automatically detected based on the weight format.

BBC-Esq · 2024-09-14T14:19:31Z

And can I run an AWQ model that's of an architecture that Ctranslate2 doens't normally support. For example, some of the Zephyr models can be converted to Ctranslate2 but they can be quantized using AWQ. Is it now possible to use them on Ctranslate2?

minhthuc2502 · 2024-09-16T08:15:23Z

No, It works only with models supported by Ctranslate2.

BBC-Esq · 2024-09-16T09:32:55Z

Anxiously awaiting the updated documentation to test further. Thanks.

BBC-Esq · 2024-09-19T09:46:07Z

Any update on this?

minhthuc2502 mentioned this issue Oct 8, 2024

Update doc AWQ quantization #1795

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to use 4-bit AWQ? #1776

How to use 4-bit AWQ? #1776

BBC-Esq commented Sep 10, 2024 •

edited

Loading

BBC-Esq commented Sep 10, 2024 •

edited

Loading

BBC-Esq commented Sep 10, 2024

minhthuc2502 commented Sep 13, 2024

BBC-Esq commented Sep 13, 2024

minhthuc2502 commented Sep 13, 2024

BBC-Esq commented Sep 14, 2024

minhthuc2502 commented Sep 16, 2024

BBC-Esq commented Sep 16, 2024

BBC-Esq commented Sep 19, 2024

How to use 4-bit AWQ? #1776

How to use 4-bit AWQ? #1776

Comments

BBC-Esq commented Sep 10, 2024 • edited Loading

BBC-Esq commented Sep 10, 2024 • edited Loading

BBC-Esq commented Sep 10, 2024

minhthuc2502 commented Sep 13, 2024

BBC-Esq commented Sep 13, 2024

minhthuc2502 commented Sep 13, 2024

BBC-Esq commented Sep 14, 2024

minhthuc2502 commented Sep 16, 2024

BBC-Esq commented Sep 16, 2024

BBC-Esq commented Sep 19, 2024

BBC-Esq commented Sep 10, 2024 •

edited

Loading

BBC-Esq commented Sep 10, 2024 •

edited

Loading