使用streaming训练在eval会卡住 #2455

1215thebqtic · 2024-11-15T02:27:36Z

在数据量比较大的时候使用streaming训练，发现会在eval的时候卡住无法继续下去，不使用streaming但使用lazy tokenize速度会慢几十倍也没办法用，目前把eval strategy设成no就能正常训练了，有时间的话可以帮忙看看吗，谢谢！

下面是会卡住的训练脚本，默认eval_steps=50，所以50steps时会卡住，时间长了可能还会出现gpu oom的情况
swift sft
--model_id_or_path qwen/Qwen2-1.5B-Instruct
--use_flash_attn False
--num_train_epochs 5
--batch_size 2
--save_total_limit -1
--sft_type lora
--dtype fp32
--lazy_tokenize False
--streaming True
--preprocess_num_proc 8
--gradient_accumulation_steps 48
--max_steps 10000
--max_length 512
--truncation_strategy delete

tastelikefeet added the bug Something isn't working label Nov 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

使用streaming训练在eval会卡住 #2455

使用streaming训练在eval会卡住 #2455

1215thebqtic commented Nov 15, 2024

使用streaming训练在eval会卡住 #2455

使用streaming训练在eval会卡住 #2455

Comments

1215thebqtic commented Nov 15, 2024