Skip to content

Latest commit

 

History

History
144 lines (105 loc) · 11.2 KB

QUANTIZE_and_BENCHMARK.md

File metadata and controls

144 lines (105 loc) · 11.2 KB

LLM Quantization and Benchmarking

Selected Model

Quantized to GPTQ format

The base model is quantized to GPTQ format using AutoGPTQ for GPU inference. The model is quantized to 4-bit precision (medium size, balanced quality).

Quantized to GGUF format

The base model is quantized to GGUF format using llama.cpp for CPU inference. The model is quantized to 4-bit (q4_k_m) precision (medium size, balanced quality).

Benchmark for starcoderbase-3b (Quantized and Non-Quantized)

The benchmark is done using lm-evaluation-harness.

Here is the Benchmarking script.

Baseline starcoderbase-3b model (non-quantized)

Tasks Version Filter n-shot Metric Value Stderr
codexglue_code2text N/A none None smoothed_bleu_4 1.3519 ± 0.3067
- code2text_go 1 none None smoothed_bleu_4 1.5781 ± 0.3734
- code2text_java 1 none None smoothed_bleu_4 1.2778 ± 0.1991
- code2text_javascript 1 none None smoothed_bleu_4 1.1443 ± 0.1181
- code2text_php 1 none None smoothed_bleu_4 0.5171 ± 0.5171
- code2text_python 1 none None smoothed_bleu_4 2.8338 ± 1.5323
- code2text_ruby 3 none None smoothed_bleu_4 0.7601 ± 0.7601
Groups Version Filter n-shot Metric Value Stderr
codexglue_code2text N/A none None smoothed_bleu_4 1.3519 ± 0.3067
Tasks Version Filter n-shot Metric Value Stderr
bigbench_code_line_description_generate_until 1 none None exact_match 0 ± 0
Tasks Version Filter n-shot Metric Value Stderr
bigbench_code_line_description_multiple_choice 0 none None acc 0.25 ± 0.0564

Quantized starcoderbase-3b model to GPTQ format

Tasks Version Filter n-shot Metric Value Stderr
codexglue_code2text N/A none None smoothed_bleu_4 0.9254 ± 0.2109
- code2text_go 1 none None smoothed_bleu_4 1.4702 ± 0.4813
- code2text_java 1 none None smoothed_bleu_4 0.6907 ± 0.6907
- code2text_javascript 1 none None smoothed_bleu_4 0.9469 ± 0.0339
- code2text_php 1 none None smoothed_bleu_4 0.5171 ± 0.5171
- code2text_python 1 none None smoothed_bleu_4 1.1676 ± 0.2156
- code2text_ruby 3 none None smoothed_bleu_4 0.7601 ± 0.7601
Groups Version Filter n-shot Metric Value Stderr
codexglue_code2text N/A none None smoothed_bleu_4 0.9254 ± 0.2109
Tasks Version Filter n-shot Metric Value Stderr
bigbench_code_line_description_generate_until 1 none None exact_match 0 ± 0
Tasks Version Filter n-shot Metric Value Stderr
bigbench_code_line_description_multiple_choice 0 none None acc 0.1 ± 0.1

Benchmark for starcoderbase-1b (Quantized and Non-Quantized)

The benchmark is done using lm-evaluation-harness.

Here is the Benchmarking script.

Baseline starcoderbase-1b model (non-quantized)

Tasks Version Filter n-shot Metric Value Stderr
codexglue_code2text N/A none None smoothed_bleu_4 0.8767 ± 0.0592
- code2text_go 1 none None smoothed_bleu_4 1.0054 ± 0.0983
- code2text_java 1 none None smoothed_bleu_4 1.2158 ± 0.1657
- code2text_javascript 1 none None smoothed_bleu_4 0.8560 ± 0.0429
- code2text_php 1 none None smoothed_bleu_4 0.9879 ± 0.0887
- code2text_python 1 none None smoothed_bleu_4 1.1950 ± 0.2819
- code2text_ruby 3 none None smoothed_bleu_4 0.0000 ± 0.0000
Groups Version Filter n-shot Metric Value Stderr
codexglue_code2text N/A none None smoothed_bleu_4 0.8767 ± 0.0592
Tasks Version Filter n-shot Metric Value Stderr
bigbench_code_line_description_generate_until 1 none None exact_match 0 ± 0
Tasks Version Filter n-shot Metric Value Stderr
bigbench_code_line_description_multiple_choice 0 none None acc 0.15 ± 0.0465

Quantized starcoderbase-1b model to GPTQ format

Tasks Version Filter n-shot Metric Value Stderr
codexglue_code2text N/A none None smoothed_bleu_4 0.7959 ± 0.2180
- code2text_go 1 none None smoothed_bleu_4 0.9280 ± 0.0291
- code2text_java 1 none None smoothed_bleu_4 1.2112 ± 0.1703
- code2text_javascript 1 none None smoothed_bleu_4 0.8848 ± 0.0391
- code2text_php 1 none None smoothed_bleu_4 0.6055 ± 0.6055
- code2text_python 1 none None smoothed_bleu_4 1.1460 ± 1.1460
- code2text_ruby 3 none None smoothed_bleu_4 0.0000 ± 0.0000
Groups Version Filter n-shot Metric Value Stderr
codexglue_code2text N/A none None smoothed_bleu_4 0.7959 ± 0.218
Tasks Version Filter n-shot Metric Value Stderr
bigbench_code_line_description_generate_until 1 none None exact_match 0 ± 0
Tasks Version Filter n-shot Metric Value Stderr
bigbench_code_line_description_multiple_choice 0 none None acc 0.1333 ± 0.0443

Challenges and Adapted Solutions

  • While benchmarking using lm-evaluation-harness, I encountered an issue which I have raised here in detail. I fixed this issue with an MR.
  • While researching, I also used bigcode-evaluation-harness for benchmarking. But in terms of usage, I got an issue which I have discussed here on length.
  • Since I have been using colab with 1 T4 GPU, and kaggle kernel with 2 T4 GPU, I had a major issue with compute resources.

Some notable attempts

While researching and implementing, I did few things but are not included in the final implementation.