CUDA ERROR #5

enesdoruk · 2024-05-23T10:34:16Z

Hi,
Firstly, thanks for your contribution. I have 2 GPUs system with RTX3080ti (12 GB RAM) when i run train.sh i got CUDA error. i reduced batch size but i got same error again, how can i resolve?

NOTE: there is no system or tools problem, i train different models on this system and conda env.

terminate called after throwing an instance of 'std::runtime_error'
what(): NCCL Error 1: unhandled cuda error
./run_train.sh: line 37: 146957 Aborted (core dumped) CUDA_VISIBLE_DEVICES=0,1 python run.py --model_name_or_path facebook/bart-large --do_train --do_eval --do_predict --train_file data/elife/train.json --validation_file data/elife/validation.json --test_file data/elife/test.json --output_dir outputs/train --per_device_train_batch_size 4 --gradient_accumulation_steps 2 --per_device_eval_batch_size 4 --num_train_epochs 10 --learning_rate 3e-5 --warmup_steps 1500 --weight_decay 0.01 --max_grad_norm 0.1 --metric_for_best_model rougeLsum --evaluation_strategy epoch --save_strategy epoch --fp16 false --bosent_token_id 50264 --encoder_loss_ratio 1.0 --encoder_label_smoothing 0.1 --encoder_label_smoothing_type adjacent --lower_saliency_threshold 0.125 --higher_saliency_threshold 0.230 --marginal_distribution true --marginal_temperature 0.5 --num_beams 5 --max_length 256 --min_length 20 --length_penalty 1.5 --no_repeat_ngram_size 3 --overwrite_output_dir --predict_with_generate

enesdoruk · 2024-05-23T10:58:33Z

this code is works
"
CUDA_VISIBLE_DEVICES=0 python run.py
--model_name_or_path facebook/bart-large
--do_train
--do_eval
--do_predict
--train_file data/elife/train.json
--validation_file data/elife/validation.json
--test_file data/elife/test.json
--output_dir outputs/train
--per_device_train_batch_size 1
--gradient_accumulation_steps 2
--per_device_eval_batch_size 1
--num_train_epochs 10
--learning_rate 3e-5
--warmup_steps 1500
--weight_decay 0.01
--max_grad_norm 0.01
--metric_for_best_model rougeLsum
--evaluation_strategy epoch
--save_strategy epoch
--fp16 false
--bosent_token_id 50264
--encoder_loss_ratio 1.0
--encoder_label_smoothing 0.1
--encoder_label_smoothing_type adjacent
--lower_saliency_threshold 0.125
--higher_saliency_threshold 0.230
--marginal_distribution true
--marginal_temperature 0.5
--num_beams 5
--max_length 256
--min_length 20
--length_penalty 1.5
--no_repeat_ngram_size 3
--overwrite_output_dir
--predict_with_generate"

when i set false fp16 it works.

enesdoruk · 2024-05-23T11:04:01Z

and when i use single gpu, it works but it is not works on 2 gpus

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA ERROR #5

CUDA ERROR #5

enesdoruk commented May 23, 2024 •

edited

Loading

enesdoruk commented May 23, 2024

enesdoruk commented May 23, 2024

CUDA ERROR #5

CUDA ERROR #5

Comments

enesdoruk commented May 23, 2024 • edited Loading

enesdoruk commented May 23, 2024

enesdoruk commented May 23, 2024

enesdoruk commented May 23, 2024 •

edited

Loading