You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It's not a Lighteval bug but an lm-evaluation-harness one. I'm just letting you know so you can investigate the batch size issue
Describe the bug
I got the bug with setting the batch_size = auto. It originally occurred with the glaiveai/Reflection-Llama-3.1-70B evaluation, but it has repeated itself now with rombodawg/Rombos-LLM-V2.5-Qwen-32b (I ran a manual evaluation for this model to check for an another bug)
The error occurs when batch_size = auto:1 is passed to the evaluation process, so it hangs the evaluation process. The longest I detected was two days with no "vital signs" in the log file. According to htop and nvidia-smi the process was indeed running.
This issue can be fixed by removing the --batch_size auto flag from the slurm file to set "batch_size=1" for the whole evaluation process.
To Reproduce
I use lm-evaluation-harness, with adding_all_changess branch. My accelerate setup is this:
I expect that batch_size=auto will not block the evaluation process when it is automatically set to 1
p.s. Wow, what a cool automatic bug report template!
p.p.s. Please, ping me if something is unclear or I need to change the type of this issue. I decided to describe it as a bug
The text was updated successfully, but these errors were encountered:
Disclaimer
It's not a Lighteval bug but an lm-evaluation-harness one. I'm just letting you know so you can investigate the batch size issue
Describe the bug
I got the bug with setting the
batch_size = auto
. It originally occurred with theglaiveai/Reflection-Llama-3.1-70B
evaluation, but it has repeated itself now withrombodawg/Rombos-LLM-V2.5-Qwen-32b
(I ran a manual evaluation for this model to check for an another bug)The error occurs when
batch_size = auto:1
is passed to the evaluation process, so it hangs the evaluation process. The longest I detected was two days with no "vital signs" in the log file. According tohtop
andnvidia-smi
the process was indeed running.This issue can be fixed by removing the
--batch_size auto
flag from the slurm file to set "batch_size=1" for the whole evaluation process.To Reproduce
adding_all_changess
branch. My accelerate setup is this:Expected behavior
I expect that
batch_size=auto
will not block the evaluation process when it is automatically set to 1p.s. Wow, what a cool automatic bug report template!
p.p.s. Please, ping me if something is unclear or I need to change the type of this issue. I decided to describe it as a bug
The text was updated successfully, but these errors were encountered: