Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA ERROR #5

Open
enesdoruk opened this issue May 23, 2024 · 2 comments
Open

CUDA ERROR #5

enesdoruk opened this issue May 23, 2024 · 2 comments

Comments

@enesdoruk
Copy link

enesdoruk commented May 23, 2024

Hi,
Firstly, thanks for your contribution. I have 2 GPUs system with RTX3080ti (12 GB RAM) when i run train.sh i got CUDA error. i reduced batch size but i got same error again, how can i resolve?

NOTE: there is no system or tools problem, i train different models on this system and conda env.

terminate called after throwing an instance of 'std::runtime_error'
what(): NCCL Error 1: unhandled cuda error
./run_train.sh: line 37: 146957 Aborted (core dumped) CUDA_VISIBLE_DEVICES=0,1 python run.py --model_name_or_path facebook/bart-large --do_train --do_eval --do_predict --train_file data/elife/train.json --validation_file data/elife/validation.json --test_file data/elife/test.json --output_dir outputs/train --per_device_train_batch_size 4 --gradient_accumulation_steps 2 --per_device_eval_batch_size 4 --num_train_epochs 10 --learning_rate 3e-5 --warmup_steps 1500 --weight_decay 0.01 --max_grad_norm 0.1 --metric_for_best_model rougeLsum --evaluation_strategy epoch --save_strategy epoch --fp16 false --bosent_token_id 50264 --encoder_loss_ratio 1.0 --encoder_label_smoothing 0.1 --encoder_label_smoothing_type adjacent --lower_saliency_threshold 0.125 --higher_saliency_threshold 0.230 --marginal_distribution true --marginal_temperature 0.5 --num_beams 5 --max_length 256 --min_length 20 --length_penalty 1.5 --no_repeat_ngram_size 3 --overwrite_output_dir --predict_with_generate

@enesdoruk
Copy link
Author

this code is works
"
CUDA_VISIBLE_DEVICES=0 python run.py
--model_name_or_path facebook/bart-large
--do_train
--do_eval
--do_predict
--train_file data/elife/train.json
--validation_file data/elife/validation.json
--test_file data/elife/test.json
--output_dir outputs/train
--per_device_train_batch_size 1
--gradient_accumulation_steps 2
--per_device_eval_batch_size 1
--num_train_epochs 10
--learning_rate 3e-5
--warmup_steps 1500
--weight_decay 0.01
--max_grad_norm 0.01
--metric_for_best_model rougeLsum
--evaluation_strategy epoch
--save_strategy epoch
--fp16 false
--bosent_token_id 50264
--encoder_loss_ratio 1.0
--encoder_label_smoothing 0.1
--encoder_label_smoothing_type adjacent
--lower_saliency_threshold 0.125
--higher_saliency_threshold 0.230
--marginal_distribution true
--marginal_temperature 0.5
--num_beams 5
--max_length 256
--min_length 20
--length_penalty 1.5
--no_repeat_ngram_size 3
--overwrite_output_dir
--predict_with_generate"

when i set false fp16 it works.

@enesdoruk
Copy link
Author

and when i use single gpu, it works but it is not works on 2 gpus

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant