You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi,
Firstly, thanks for your contribution. I have 2 GPUs system with RTX3080ti (12 GB RAM) when i run train.sh i got CUDA error. i reduced batch size but i got same error again, how can i resolve?
NOTE: there is no system or tools problem, i train different models on this system and conda env.
Hi,
Firstly, thanks for your contribution. I have 2 GPUs system with RTX3080ti (12 GB RAM) when i run train.sh i got CUDA error. i reduced batch size but i got same error again, how can i resolve?
NOTE: there is no system or tools problem, i train different models on this system and conda env.
terminate called after throwing an instance of 'std::runtime_error'
what(): NCCL Error 1: unhandled cuda error
./run_train.sh: line 37: 146957 Aborted (core dumped) CUDA_VISIBLE_DEVICES=0,1 python run.py --model_name_or_path facebook/bart-large --do_train --do_eval --do_predict --train_file data/elife/train.json --validation_file data/elife/validation.json --test_file data/elife/test.json --output_dir outputs/train --per_device_train_batch_size 4 --gradient_accumulation_steps 2 --per_device_eval_batch_size 4 --num_train_epochs 10 --learning_rate 3e-5 --warmup_steps 1500 --weight_decay 0.01 --max_grad_norm 0.1 --metric_for_best_model rougeLsum --evaluation_strategy epoch --save_strategy epoch --fp16 false --bosent_token_id 50264 --encoder_loss_ratio 1.0 --encoder_label_smoothing 0.1 --encoder_label_smoothing_type adjacent --lower_saliency_threshold 0.125 --higher_saliency_threshold 0.230 --marginal_distribution true --marginal_temperature 0.5 --num_beams 5 --max_length 256 --min_length 20 --length_penalty 1.5 --no_repeat_ngram_size 3 --overwrite_output_dir --predict_with_generate
The text was updated successfully, but these errors were encountered: