Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CTC/AED PROBLEM IN K2 #1618

Closed
zw76859420 opened this issue May 6, 2024 · 10 comments
Closed

CTC/AED PROBLEM IN K2 #1618

zw76859420 opened this issue May 6, 2024 · 10 comments

Comments

@zw76859420
Copy link

When we trained model(#1389) with k2, the following problem will occured. We don't know how to solve it. Can anyone help me?

image

@marcoyang1998
Copy link
Collaborator

It seems that you have nan values in the forward pass of your model. How many GPUs and what max_duration are you using?

@zw76859420
Copy link
Author

Thanks for your kind reply, and we used the following params to train zipformer(CTC/AED) model.

./zipformer/train.py \ --world-size 3 \ --num-epochs 360 \ --start-epoch 1 \ --use-fp16 0 \ --exp-dir zipformer/exp \ --base-lr 0.045 \ --lr-epochs 1.5 \ --max-duration 350 \ --enable-musan 0 \ --use-fp16 0 \ --lang-dir data/lang_char \ --manifest-dir data/fbank \ --on-the-fly-feats 0 \ --save-every-n 2000 \ --keep-last-k 20 \ --inf-check 1 \ --use-transducer 0 \ --use-ctc 1 \ --use-attention-decoder 1 \ --ctc-loss-scale 0.1 \ --attention-decoder-loss-scale 0.9 \ --num-encoder-layers 2,2,4,6,4,2 \ --feedforward-dim 512,768,1536,2048,1536,768 \ --encoder-dim 192,256,512,768,512,256 \ --encoder-unmasked-dim 192,192,256,320,256,192

@zw76859420
Copy link
Author

We using the following code to remove the utt in zipformer/train.py:

image

image

@danpovey
Copy link
Collaborator

danpovey commented May 6, 2024 via email

@yaozengwei
Copy link
Collaborator

He ran that with --inf-check True. The log shows "module.encoder_embed.conv.0.output is not finite. ". The total batch size he was using (--world-size 3 --max-duration 350) might be too small with a base-lr of 0.045.

@zw76859420
Copy link
Author

Thanks for advise, and we use the following model config, the same problem still oocrs.

export CUDA_VISIBLE_DEVICES="0,1,2"

./zipformer/train.py
--world-size 3
--num-epochs 360
--start-epoch 1
--use-fp16 1
--exp-dir zipformer/exp
--base-lr 0.045
--lr-epochs 1.5
--max-duration 600
--enable-musan 0
--use-fp16 0
--lang-dir data/lang_char
--manifest-dir data/fbank
--on-the-fly-feats 0
--save-every-n 2000
--keep-last-k 20
--inf-check 1
--use-transducer 1
--use-ctc 1
--use-attention-decoder 1
--ctc-loss-scale 0.3
--attention-decoder-loss-scale 0.7
--num-encoder-layers 2,2,4,5,4,2
--feedforward-dim 512,768,1536,2048,1536,768
--encoder-dim 192,256,512,768,512,256
--encoder-unmasked-dim 192,192,256,320,256,192

@zw76859420
Copy link
Author

The detail error logs are as follow:

Traceback (most recent call last):
File "./zipformer/train.py", line 1520, in
main()
File "./zipformer/train.py", line 1511, in main
mp.spawn(run, args=(world_size, args), nprocs=world_size, join=True)
File "icefall/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "icefall/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
while not context.join():
File "icefall/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 160, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "icefall/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, *args)
File "ASR/zipformer/train.py", line 1318, in run
train_one_epoch(
File "ASR/zipformer/train.py", line 1009, in train_one_epoch
loss, loss_info = compute_loss(
File "ASR/zipformer/train.py", line 843, in compute_loss
simple_loss, pruned_loss, ctc_loss, attention_decoder_loss = model(
File "icefall/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1128, in _call_impl
result = forward_call(*input, **kwargs)
File "icefall/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 963, in forward
output = self.module(*inputs[0], **kwargs[0])
File "icefall/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1128, in _call_impl
result = forward_call(*input, **kwargs)
File "ASR/zipformer/model.py", line 338, in forward
encoder_out, encoder_out_lens = self.forward_encoder(x, x_lens)
File "ASR/zipformer/model.py", line 140, in forward_encoder
x, x_lens = self.encoder_embed(x, x_lens)
File "icefall/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1128, in _call_impl
result = forward_call(*input, **kwargs)
File "ASR/zipformer/subsampling.py", line 309, in forward
x = self.conv(x)
File "icefall/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1128, in _call_impl
result = forward_call(*input, **kwargs)
File "icefall/lib/python3.8/site-packages/torch/nn/modules/container.py", line 141, in forward
input = module(input)
File "icefall/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1131, in _call_impl
hook_result = hook(self, input, result)
File "icefall/icefall/hooks.py", line 43, in forward_hook
raise ValueError(
ValueError: The sum of module.encoder_embed.conv.0.output is not finite: tensor([[[[nan, nan, nan, ..., nan, nan, nan],

@danpovey
Copy link
Collaborator

danpovey commented May 7, 2024

I meant the normal training log output leading up to that point, not the error traceback.
(But yeah, for max-duration of 600 and world-size 3, perhaps that base-lr is too large, could
try 0.03 for instance.)

@zw76859420
Copy link
Author

Thanks Dan, my idol.
Zengwei also gave me the same suggestion, and we now modify the base-lr to 0.035 based on the above configs.
We will show you the training results as soon as possible.

@zw76859420
Copy link
Author

I meant the normal training log output leading up to that point, not the error traceback. (But yeah, for max-duration of 600 and world-size 3, perhaps that base-lr is too large, could try 0.03 for instance.)

It works!!!
When I reduced base-lr from 0.045 to 0.030, the loss is currently dropping quite normally.
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants