LLM Quantization and Benchmarking

Selected Model

bigcode/starcoderbase-3b
bigcode/starcoderbase-1b

Quantized to GPTQ format

The base model is quantized to GPTQ format using AutoGPTQ for GPU inference. The model is quantized to 4-bit precision (medium size, balanced quality).

HuggingFace model card for cosmo3769/starcoderbase-3b-GPTQ. Here is the Quantization script.
HuggingFace model card for cosmo3769/starcoderbase-1b-GPTQ. Here is the Quantization script.

Quantized to GGUF format

The base model is quantized to GGUF format using llama.cpp for CPU inference. The model is quantized to 4-bit (q4_k_m) precision (medium size, balanced quality).

HuggingFace model card for cosmo3769/starcoderbase-3b-GGUF. Here is the Quantization script.
HuggingFace model card for cosmo3769/starcoderbase-1b-GGUF. Here is the Quantization script.

Benchmark for starcoderbase-3b (Quantized and Non-Quantized)

The benchmark is done using lm-evaluation-harness.

Here is the Benchmarking script.

Baseline starcoderbase-3b model (non-quantized)

Tasks	Version	Filter	n-shot	Metric	Value		Stderr
codexglue_code2text	N/A	none	None	smoothed_bleu_4	1.3519	±	0.3067
- code2text_go	1	none	None	smoothed_bleu_4	1.5781	±	0.3734
- code2text_java	1	none	None	smoothed_bleu_4	1.2778	±	0.1991
- code2text_javascript	1	none	None	smoothed_bleu_4	1.1443	±	0.1181
- code2text_php	1	none	None	smoothed_bleu_4	0.5171	±	0.5171
- code2text_python	1	none	None	smoothed_bleu_4	2.8338	±	1.5323
- code2text_ruby	3	none	None	smoothed_bleu_4	0.7601	±	0.7601

Groups	Version	Filter	n-shot	Metric	Value		Stderr
codexglue_code2text	N/A	none	None	smoothed_bleu_4	1.3519	±	0.3067

Tasks	Version	Filter	n-shot	Metric	Value		Stderr
bigbench_code_line_description_generate_until	1	none	None	exact_match	0	±	0

Tasks	Version	Filter	n-shot	Metric	Value		Stderr
bigbench_code_line_description_multiple_choice	0	none	None	acc	0.25	±	0.0564

Quantized starcoderbase-3b model to GPTQ format

Tasks	Version	Filter	n-shot	Metric	Value		Stderr
codexglue_code2text	N/A	none	None	smoothed_bleu_4	0.9254	±	0.2109
- code2text_go	1	none	None	smoothed_bleu_4	1.4702	±	0.4813
- code2text_java	1	none	None	smoothed_bleu_4	0.6907	±	0.6907
- code2text_javascript	1	none	None	smoothed_bleu_4	0.9469	±	0.0339
- code2text_php	1	none	None	smoothed_bleu_4	0.5171	±	0.5171
- code2text_python	1	none	None	smoothed_bleu_4	1.1676	±	0.2156
- code2text_ruby	3	none	None	smoothed_bleu_4	0.7601	±	0.7601

Groups	Version	Filter	n-shot	Metric	Value		Stderr
codexglue_code2text	N/A	none	None	smoothed_bleu_4	0.9254	±	0.2109

Tasks	Version	Filter	n-shot	Metric	Value		Stderr
bigbench_code_line_description_generate_until	1	none	None	exact_match	0	±	0

Tasks	Version	Filter	n-shot	Metric	Value		Stderr
bigbench_code_line_description_multiple_choice	0	none	None	acc	0.1	±	0.1

Benchmark for starcoderbase-1b (Quantized and Non-Quantized)

The benchmark is done using lm-evaluation-harness.

Here is the Benchmarking script.

Baseline starcoderbase-1b model (non-quantized)

Tasks	Version	Filter	n-shot	Metric	Value		Stderr
codexglue_code2text	N/A	none	None	smoothed_bleu_4	0.8767	±	0.0592
- code2text_go	1	none	None	smoothed_bleu_4	1.0054	±	0.0983
- code2text_java	1	none	None	smoothed_bleu_4	1.2158	±	0.1657
- code2text_javascript	1	none	None	smoothed_bleu_4	0.8560	±	0.0429
- code2text_php	1	none	None	smoothed_bleu_4	0.9879	±	0.0887
- code2text_python	1	none	None	smoothed_bleu_4	1.1950	±	0.2819
- code2text_ruby	3	none	None	smoothed_bleu_4	0.0000	±	0.0000

Groups	Version	Filter	n-shot	Metric	Value		Stderr
codexglue_code2text	N/A	none	None	smoothed_bleu_4	0.8767	±	0.0592

Tasks	Version	Filter	n-shot	Metric	Value		Stderr
bigbench_code_line_description_generate_until	1	none	None	exact_match	0	±	0

Tasks	Version	Filter	n-shot	Metric	Value		Stderr
bigbench_code_line_description_multiple_choice	0	none	None	acc	0.15	±	0.0465

Quantized starcoderbase-1b model to GPTQ format

Tasks	Version	Filter	n-shot	Metric	Value		Stderr
codexglue_code2text	N/A	none	None	smoothed_bleu_4	0.7959	±	0.2180
- code2text_go	1	none	None	smoothed_bleu_4	0.9280	±	0.0291
- code2text_java	1	none	None	smoothed_bleu_4	1.2112	±	0.1703
- code2text_javascript	1	none	None	smoothed_bleu_4	0.8848	±	0.0391
- code2text_php	1	none	None	smoothed_bleu_4	0.6055	±	0.6055
- code2text_python	1	none	None	smoothed_bleu_4	1.1460	±	1.1460
- code2text_ruby	3	none	None	smoothed_bleu_4	0.0000	±	0.0000

Groups	Version	Filter	n-shot	Metric	Value		Stderr
codexglue_code2text	N/A	none	None	smoothed_bleu_4	0.7959	±	0.218

Tasks	Version	Filter	n-shot	Metric	Value		Stderr
bigbench_code_line_description_generate_until	1	none	None	exact_match	0	±	0

Tasks	Version	Filter	n-shot	Metric	Value		Stderr
bigbench_code_line_description_multiple_choice	0	none	None	acc	0.1333	±	0.0443

Challenges and Adapted Solutions

While benchmarking using lm-evaluation-harness, I encountered an issue which I have raised here in detail. I fixed this issue with an MR.
While researching, I also used bigcode-evaluation-harness for benchmarking. But in terms of usage, I got an issue which I have discussed here on length.
Since I have been using colab with 1 T4 GPU, and kaggle kernel with 2 T4 GPU, I had a major issue with compute resources.

Some notable attempts

While researching and implementing, I did few things but are not included in the final implementation.

Quantized codellama/CodeLlama-7b-hf to GGUF format. Here is the HuggingFace model card for cosmo3769/CodeLlama-7b-hf-GGUF. Here is the Quantization script.
Quantized mlabonne/EvolCodeLlama-7b to GGUF format. Here is the HuggingFace model card for mlabonne/EvolCodeLlama-7b-GGUF. Here is the Quantization script. NOTE - The quantized model was already pushed by mlabonne so I have not pushed it to Hub.
Quantized stabilityai/stablelm-zephyr-3b to GGUF format. Here is the HuggingFace model card for cosmo3769/stablelm-zephyr-3b-GGUF. Here is the Quantization script.
Quantized bigcode/starcoderbase-3b to GGUF format (q4_k_s). Here is the HuggingFace model card for cosmo3769/starcoderbase-3b-GGUF. Here is the Quantization script.
I have also quantized gemma models but have not included here since it was already done heavily by other user.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

QUANTIZE_and_BENCHMARK.md

QUANTIZE_and_BENCHMARK.md

LLM Quantization and Benchmarking

Selected Model

Quantized to GPTQ format

Quantized to GGUF format

Benchmark for starcoderbase-3b (Quantized and Non-Quantized)

Baseline starcoderbase-3b model (non-quantized)

Quantized starcoderbase-3b model to GPTQ format

Benchmark for starcoderbase-1b (Quantized and Non-Quantized)

Baseline starcoderbase-1b model (non-quantized)

Quantized starcoderbase-1b model to GPTQ format

Challenges and Adapted Solutions

Some notable attempts

Files

QUANTIZE_and_BENCHMARK.md

Latest commit

History

QUANTIZE_and_BENCHMARK.md

File metadata and controls

LLM Quantization and Benchmarking

Selected Model

Quantized to GPTQ format

Quantized to GGUF format

Benchmark for starcoderbase-3b (Quantized and Non-Quantized)

Baseline starcoderbase-3b model (non-quantized)

Quantized starcoderbase-3b model to GPTQ format

Benchmark for starcoderbase-1b (Quantized and Non-Quantized)

Baseline starcoderbase-1b model (non-quantized)

Quantized starcoderbase-1b model to GPTQ format

Challenges and Adapted Solutions

Some notable attempts