Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

loss becomes nan during training #517

Open
Davidwhw opened this issue Jun 13, 2024 · 0 comments
Open

loss becomes nan during training #517

Davidwhw opened this issue Jun 13, 2024 · 0 comments

Comments

@Davidwhw
Copy link

Davidwhw commented Jun 13, 2024

When I pre-trained for phase 1 using the coco dataset (downloaded using the script in lavis), loss quickly became nan. I found the problem that I mistakenly used vicuna-7B-V1.5 as LLM instead of the default Vicuna V07B.
I wonder why different versions of vicuna cause nan errors for loss. The same problem may arise with Llama-3?
Does anyone know the possible cause?

Here is the training log:

2024-06-12 16:20:56,611 [INFO] Start training
2024-06-12 16:21:06,082 [INFO] dataset_ratios not specified, datasets will be concatenated (map-style datasets) or chained (webdataset.DataPipeline).
2024-06-12 16:21:06,082 [INFO] Loaded 414113 records for train split from the dataset.
batch sizes [[64]]
module.llama_proj.weight
module.llama_proj.bias
2024-06-12 16:21:06,100 [INFO] number of trainable parameters: 3149824
2024-06-12 16:21:06,101 [INFO] Start training epoch 0, 5000 iters per inner epoch.
[W reducer.cpp:1300] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
[W reducer.cpp:1300] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
Train: data epoch: [0] [ 0/5000] eta: 6:21:59 lr: 0.000001 loss: 6.8750 time: 4.5839 data: 0.0000 max mem: 54099
Train: data epoch: [0] [ 50/5000] eta: 5:30:41 lr: 0.000002 loss: 6.9350 time: 4.3366 data: 0.0000 max mem: 55634
Train: data epoch: [0] [ 100/5000] eta: 5:36:02 lr: 0.000003 loss: nan time: 4.2825 data: 0.0000 max mem: 55634
Train: data epoch: [0] [ 150/5000] eta: 5:35:11 lr: 0.000004 loss: 6.9082 time: 4.2402 data: 0.0000 max mem: 55634
Train: data epoch: [0] [ 200/5000] eta: 5:32:37 lr: 0.000005 loss: 6.8793 time: 4.1392 data: 0.0000 max mem: 55634
Train: data epoch: [0] [ 250/5000] eta: 5:29:55 lr: 0.000006 loss: nan time: 4.0571 data: 0.0000 max mem: 55634
Train: data epoch: [0] [ 300/5000] eta: 5:19:55 lr: 0.000007 loss: nan time: 3.5617 data: 0.0000 max mem: 55634
Train: data epoch: [0] [ 350/5000] eta: 5:11:46 lr: 0.000008 loss: nan time: 3.4891 data: 0.0000 max mem: 55634
Train: data epoch: [0] [ 400/5000] eta: 5:03:33 lr: 0.000009 loss: nan time: 3.5205 data: 0.0000 max mem: 55634
Train: data epoch: [0] [ 450/5000] eta: 4:57:14 lr: 0.000010 loss: nan time: 3.5452 data: 0.0000 max mem: 55634
Train: data epoch: [0] [ 500/5000] eta: 4:51:30 lr: 0.000011 loss: nan time: 3.4910 data: 0.0000 max mem: 55634
Train: data epoch: [0] [ 550/5000] eta: 4:46:23 lr: 0.000012 loss: nan time: 3.4316 data: 0.0000 max mem: 55634
Train: data epoch: [0] [ 600/5000] eta: 4:41:38 lr: 0.000013 loss: nan time: 3.5642 data: 0.0000 max mem: 55634
Train: data epoch: [0] [ 650/5000] eta: 4:37:05 lr: 0.000014 loss: nan time: 3.5331 data: 0.0000 max mem: 55634
Train: data epoch: [0] [ 700/5000] eta: 4:33:05 lr: 0.000015 loss: nan time: 3.4861 data: 0.0000 max mem: 55634
Train: data epoch: [0] [ 750/5000] eta: 4:29:03 lr: 0.000016 loss: nan time: 3.4726 data: 0.0000 max mem: 55634
Train: data epoch: [0] [ 800/5000] eta: 4:25:02 lr: 0.000017 loss: nan time: 3.6616 data: 0.0000 max mem: 55634
Train: data epoch: [0] [ 850/5000] eta: 4:21:01 lr: 0.000018 loss: nan time: 3.5413 data: 0.0000 max mem: 55634
Train: data epoch: [0] [ 900/5000] eta: 4:17:19 lr: 0.000019 loss: nan time: 3.6708 data: 0.0000 max mem: 55634
Train: data epoch: [0] [ 950/5000] eta: 4:13:42 lr: 0.000020 loss: nan time: 3.6225 data: 0.0000 max mem: 55634
Train: data epoch: [0] [1000/5000] eta: 4:10:01 lr: 0.000021 loss: nan time: 3.5988 data: 0.0000 max mem: 55634
...

Can you give me some advice or clues?
Thank you for your assistance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant