[BUG] batch_size = auto:1 issue #353

alozowski · 2024-10-09T15:55:46Z

Disclaimer

It's not a Lighteval bug but an lm-evaluation-harness one. I'm just letting you know so you can investigate the batch size issue

Describe the bug

I got the bug with setting the batch_size = auto. It originally occurred with the glaiveai/Reflection-Llama-3.1-70B evaluation, but it has repeated itself now with rombodawg/Rombos-LLM-V2.5-Qwen-32b (I ran a manual evaluation for this model to check for an another bug)

The error occurs when batch_size = auto:1 is passed to the evaluation process, so it hangs the evaluation process. The longest I detected was two days with no "vital signs" in the log file. According to htop and nvidia-smi the process was indeed running.

This issue can be fixed by removing the --batch_size auto flag from the slurm file to set "batch_size=1" for the whole evaluation process.

To Reproduce

I use lm-evaluation-harness, with adding_all_changess branch. My accelerate setup is this:

accelerate launch --multi_gpu --num_processes 4 \
    -m lm_eval \
    --model_args pretrained=${model},revision=${revision},trust_remote_code=False,dtype=bfloat16,parallelize=True \
    --tasks leaderboard \
    --batch_size auto \
    --output_path eval_results \
    --log_samples \
    --hf_hub_log_args hub_results_org=open-llm-leaderboard,hub_repo_name=${model////__}-details,push_results_to_hub=True,push_samples_to_hub=True,public_repo=True

Expected behavior

I expect that batch_size=auto will not block the evaluation process when it is automatically set to 1

p.s. Wow, what a cool automatic bug report template!
p.p.s. Please, ping me if something is unclear or I need to change the type of this issue. I decided to describe it as a bug

The text was updated successfully, but these errors were encountered:

alozowski added the bug Something isn't working label Oct 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] batch_size = auto:1 issue #353

[BUG] batch_size = auto:1 issue #353

alozowski commented Oct 9, 2024

[BUG] batch_size = auto:1 issue #353

[BUG] batch_size = auto:1 issue #353

Comments

alozowski commented Oct 9, 2024

Disclaimer

Describe the bug

To Reproduce

Expected behavior