Reproduce QWen 2.5-14B-Instruct and LLaMa-3.1-8B-Instruct Results #2344

ruleGreen · 2024-09-25T02:26:20Z

Hi, thank you all for very nice work. I would like to use this wonderful work to re-produce the evaluation performance reported by qwen 2.5 and llama 3.1. However, I encounter several problems as shown below:

the performance of llama3.1-8b-instruct on gsm8k-cot-zeroshot

the performance of qwen-2.5-14b-instruct on gsm8k-cot-zeroshot

the scores are much lower compared with original reports, especially for llama3.1-8b-instruct

I directly use command like

lm_eval --model hf --tasks gsm8k_cot_zeroshot --model_args pretrained=/pathto/Qwen2.5-14B-Instruct,parallelize=True --batch_size 64 --output_path ./results --log_samples

is there anything wrong? should I change dtype, batch_size, and also template / max_tokens or others?
besides that, when i use this command to run gsm8k, the program just get stuck without running the evaluation after loading the model.

EdisonChenn · 2024-09-26T02:26:54Z

I faced the same problem.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reproduce QWen 2.5-14B-Instruct and LLaMa-3.1-8B-Instruct Results #2344

Reproduce QWen 2.5-14B-Instruct and LLaMa-3.1-8B-Instruct Results #2344

ruleGreen commented Sep 25, 2024

EdisonChenn commented Sep 26, 2024

Reproduce QWen 2.5-14B-Instruct and LLaMa-3.1-8B-Instruct Results #2344

Reproduce QWen 2.5-14B-Instruct and LLaMa-3.1-8B-Instruct Results #2344

Comments

ruleGreen commented Sep 25, 2024

EdisonChenn commented Sep 26, 2024