Which version to use #2322

sorobedio · 2024-09-19T04:07:24Z

Could you explain why the performance of the same model changes significantly depending on the version of lm_eval?

for example with llama3-1-8b-instruct with a batch_size=1
v0.4.2 (leaderboard version)

Tasks	Version	Filter	Metric		Value		Stderr
leaderboard_gpqa	N/A	none	acc_norm	↑	0.2743	±	0.0129
- leaderboard_gpqa_diamond	1	none	acc_norm	↑	0.2879	±	0.0323
- leaderboard_gpqa_extended	1	none	acc_norm	↑	0.2637	±	0.0189
- leaderboard_gpqa_main	1	none	acc_norm	↑	0.2812	±	0.0213

Groups	Version	Filter	n-shot	Metric		Value		Stderr
leaderboard_gpqa	N/A	none	0	acc_norm	↑	0.2743	±	0.0129

with version v0.4.3

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
leaderboard_gpqa	N/A
- leaderboard_gpqa_diamond	1	none	0	acc_norm	↑	0.2879	±	0.0323
- leaderboard_gpqa_extended	1	none	0	acc_norm	↑	0.3022	±	0.0197
- leaderboard_gpqa_main	1	none	0	acc_norm	↑	0.2812	±	0.0213

with version v0.4.4

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
leaderboard_gpqa	N/A
- leaderboard_gpqa_diamond	1	none	0	acc_norm	↑	0.3636	±	0.0343
- leaderboard_gpqa_extended	1	none	0	acc_norm	↑	0.3242	±	0.0200
- leaderboard_gpqa_main	1	none	0	acc_norm	↑	0.3214	±	0.0221

I'm unsure which version to use, as my model performs well on version 0.4.4 but is outperformed by the base LLaMA 3.1-8B on version 0.4.2.

baberabb · 2024-09-19T06:31:38Z

Hi! I think the leaderboard tasks were only added in 0.4.4? Also I believe the leaderboard uses this fork to run their evaluations: https://github.com/huggingface/lm-evaluation-harness

sorobedio · 2024-09-19T07:13:25Z

thank you. i checked the leader board they use 0.4.2 as today. That link you share seem to from version 0.4.3

baberabb · 2024-09-19T10:18:02Z

They link to the this branch of their repo in their docs, so you should use that. There might have been some delay from when their changes were merged here, and when we cut a new release.

That branch and 0.4.4 should give the same scores but let me know if thats not the case!

sorobedio · 2024-09-19T17:33:59Z

Thank you for your response. I have actually reviewed their documentation. The results in the first table reflect the outcomes based on the installation guide provided in their docs. The second table corresponds to the repository link they mention in their documentation, which is the same one you recommended to me. The third table shows results for version 0.4.4.

It seems that the leaderboard is typically using version 0.4.2, while the repository you recommended corresponds to version 0.4.3. Results from 0.4.3 are generally closer to those from 0.4.2, but depending on the task, performance can still vary significantly.

As for version 0.4.4, which is the latest on this repo, the results differ considerably from the previous versions. That's why I’ve provided the results for each version, to highlight these differences. In some cases, models perform better in the newer version, but in others, they may perform worse.

My main problem is i have some issue when trying to used version 0.4.2 in code where the model name is no a string such as

` lm_eval_model = HFLM(model, device=device,
batch_size=1,
tokenizer=tokenizer,
)

    task_manager = lm_eval.tasks.TaskManager()
    #
    results = lm_eval.simple_evaluate(  # call simple_evaluate
        model=lm_eval_model,
        tasks=["leaderboard_gpqa"],
        num_fewshot=0,
        apply_chat_template=True,
        # fewshot_as_multiturn=True,
        # output_base_path="results_Out",
        task_manager=task_manager,
    )`

i got this error. lm-evaluation-harness/lm_eval/models/huggingface.py", line 409, in model return self.accelerator.unwrap_model(self._model) AttributeError: 'NoneType' object has no attribute 'unwrap_model'

if i can fix that problem i think i will keep using version 0.4.2 so that i can benchmark my model with leaderboard models.

sorobedio · 2024-09-19T17:34:47Z

the above experiments are conducted using the command line option.

baberabb · 2024-09-19T20:22:13Z

I'll look into this. For the error in 0.4.2, I think if you remove this condition and just return self._model, it will probably work.

sorobedio · 2024-09-20T08:17:02Z

Thank you for your help.
i looks like they updated the leaderboard forked to 0.4.3 today.
Let me know when you determine the reason for the performance differences between versions. In such cases, it's important to decide which version should be used for model evaluation to maintain consistency across all published results

baberabb · 2024-09-21T09:58:07Z

Hi! I made some runs on main and their adding_all_changess branch, with meta-llama/Meta-Llama-3.1-8B-Instruct. Getting similar results (and wasn't able to reproduce your results). What command are you using?

lm_eval --model hf -a pretrained=meta-llama/Meta-Llama-3.1-8B-Instruct,trust_remote_code=True,dtype=float16 --tasks leaderboard_gpqa -b 1 --device cuda:0 --trust_remote_code --verbosity DEBUG

main
hf (pretrained=meta-llama/Meta-Llama-3.1-8B-Instruct,trust_remote_code=True,dtype=float16,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 1

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
leaderboard_gpqa	N/A
- leaderboard_gpqa_diamond	1	none	0	acc_norm	↑	0.3131	±	0.0330
- leaderboard_gpqa_extended	1	none	0	acc_norm	↑	0.2930	±	0.0195
- leaderboard_gpqa_main	1	none	0	acc_norm	↑	0.3482	±	0.0225

HF
hf (pretrained=meta-llama/Meta-Llama-3.1-8B-Instruct,trust_remote_code=True,dtype=float16,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 1

Tasks	Version	Filter	Metric		Value		Stderr
leaderboard_gpqa	N/A	none	acc_norm	↑	0.3171	±	0.0135
- leaderboard_gpqa_diamond	1	none	acc_norm	↑	0.3131	±	0.0330
- leaderboard_gpqa_extended	1	none	acc_norm	↑	0.2912	±	0.0195
- leaderboard_gpqa_main	1	none	acc_norm	↑	0.3504	±	0.0226

sorobedio · 2024-09-23T05:18:13Z

aha thank here is my command
lm_eval --model hf \ --model_args pretrained=meta-llama/Meta-Llama-3.1-8B-Instruct,dtype="bfloat16"\ --tasks leaderboard_gpqa\ --device cuda:0 \ --num_fewshot 0\ --apply_chat_template\ --batch_size 1
The only differences are that I used bfloat16, the apply-chat template, and set the batch size to 1, while you used a float16 model, which is different from bfloat16. This explains why the results do not match. Additionally, you used some parameters that I did not. You can simply copy and paste my command. thank you

sorobedio changed the title ~~Which version to Trust~~ Which version to use Sep 19, 2024

baberabb added the validation For validation of task implementations. label Sep 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Which version to use #2322

Which version to use #2322

sorobedio commented Sep 19, 2024

baberabb commented Sep 19, 2024 •

edited

Loading

sorobedio commented Sep 19, 2024

baberabb commented Sep 19, 2024 •

edited

Loading

sorobedio commented Sep 19, 2024

sorobedio commented Sep 19, 2024

baberabb commented Sep 19, 2024

sorobedio commented Sep 20, 2024 •

edited

Loading

baberabb commented Sep 21, 2024 •

edited

Loading

sorobedio commented Sep 23, 2024

Which version to use #2322

Which version to use #2322

Comments

sorobedio commented Sep 19, 2024

baberabb commented Sep 19, 2024 • edited Loading

sorobedio commented Sep 19, 2024

baberabb commented Sep 19, 2024 • edited Loading

sorobedio commented Sep 19, 2024

sorobedio commented Sep 19, 2024

baberabb commented Sep 19, 2024

sorobedio commented Sep 20, 2024 • edited Loading

baberabb commented Sep 21, 2024 • edited Loading

sorobedio commented Sep 23, 2024

baberabb commented Sep 19, 2024 •

edited

Loading

baberabb commented Sep 19, 2024 •

edited

Loading

sorobedio commented Sep 20, 2024 •

edited

Loading

baberabb commented Sep 21, 2024 •

edited

Loading