CTC/AED PROBLEM IN K2 #1618

zw76859420 · 2024-05-06T09:37:06Z

When we trained model(#1389) with k2, the following problem will occured. We don't know how to solve it. Can anyone help me?

marcoyang1998 · 2024-05-06T09:39:41Z

It seems that you have nan values in the forward pass of your model. How many GPUs and what max_duration are you using?

zw76859420 · 2024-05-06T09:50:52Z

Thanks for your kind reply, and we used the following params to train zipformer(CTC/AED) model.

./zipformer/train.py \ --world-size 3 \ --num-epochs 360 \ --start-epoch 1 \ --use-fp16 0 \ --exp-dir zipformer/exp \ --base-lr 0.045 \ --lr-epochs 1.5 \ --max-duration 350 \ --enable-musan 0 \ --use-fp16 0 \ --lang-dir data/lang_char \ --manifest-dir data/fbank \ --on-the-fly-feats 0 \ --save-every-n 2000 \ --keep-last-k 20 \ --inf-check 1 \ --use-transducer 0 \ --use-ctc 1 \ --use-attention-decoder 1 \ --ctc-loss-scale 0.1 \ --attention-decoder-loss-scale 0.9 \ --num-encoder-layers 2,2,4,6,4,2 \ --feedforward-dim 512,768,1536,2048,1536,768 \ --encoder-dim 192,256,512,768,512,256 \ --encoder-unmasked-dim 192,192,256,320,256,192

zw76859420 · 2024-05-06T09:55:15Z

We using the following code to remove the utt in zipformer/train.py:

danpovey · 2024-05-06T13:58:00Z

i would guess nan has appeared in model. are you running with --inf-check True? If not, add it. and show log output leading up to that failure. may be divergence but let's see that log.

…

On Monday, May 6, 2024, hahaha ***@***.***> wrote: We using the following code to remove the utt in zipformer/train.py: image.png (view on web) <https://github.com/k2-fsa/icefall/assets/42910032/32584714-17d2-4510-8d70-24d31161eca9> image.png (view on web) <https://github.com/k2-fsa/icefall/assets/42910032/128c4832-4d6e-43de-acc3-8951794780a9> — Reply to this email directly, view it on GitHub <#1618 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAZFLO5PMEZW7BLHUDKAC7DZA5HRTAVCNFSM6AAAAABHIWPLKKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOJVGYYDAOJYHE> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

yaozengwei · 2024-05-06T18:10:39Z

He ran that with --inf-check True. The log shows "module.encoder_embed.conv.0.output is not finite. ". The total batch size he was using (--world-size 3 --max-duration 350) might be too small with a base-lr of 0.045.

zw76859420 · 2024-05-07T02:01:29Z

Thanks for advise, and we use the following model config, the same problem still oocrs.

export CUDA_VISIBLE_DEVICES="0,1,2"

./zipformer/train.py
--world-size 3
--num-epochs 360
--start-epoch 1
--use-fp16 1
--exp-dir zipformer/exp
--base-lr 0.045
--lr-epochs 1.5
--max-duration 600
--enable-musan 0
--use-fp16 0
--lang-dir data/lang_char
--manifest-dir data/fbank
--on-the-fly-feats 0
--save-every-n 2000
--keep-last-k 20
--inf-check 1
--use-transducer 1
--use-ctc 1
--use-attention-decoder 1
--ctc-loss-scale 0.3
--attention-decoder-loss-scale 0.7
--num-encoder-layers 2,2,4,5,4,2
--feedforward-dim 512,768,1536,2048,1536,768
--encoder-dim 192,256,512,768,512,256
--encoder-unmasked-dim 192,192,256,320,256,192

zw76859420 · 2024-05-07T02:06:15Z

The detail error logs are as follow:

Traceback (most recent call last):
File "./zipformer/train.py", line 1520, in
main()
File "./zipformer/train.py", line 1511, in main
mp.spawn(run, args=(world_size, args), nprocs=world_size, join=True)
File "icefall/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "icefall/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
while not context.join():
File "icefall/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 160, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "icefall/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, *args)
File "ASR/zipformer/train.py", line 1318, in run
train_one_epoch(
File "ASR/zipformer/train.py", line 1009, in train_one_epoch
loss, loss_info = compute_loss(
File "ASR/zipformer/train.py", line 843, in compute_loss
simple_loss, pruned_loss, ctc_loss, attention_decoder_loss = model(
File "icefall/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1128, in _call_impl
result = forward_call(*input, **kwargs)
File "icefall/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 963, in forward
output = self.module(*inputs[0], **kwargs[0])
File "icefall/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1128, in _call_impl
result = forward_call(*input, **kwargs)
File "ASR/zipformer/model.py", line 338, in forward
encoder_out, encoder_out_lens = self.forward_encoder(x, x_lens)
File "ASR/zipformer/model.py", line 140, in forward_encoder
x, x_lens = self.encoder_embed(x, x_lens)
File "icefall/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1128, in _call_impl
result = forward_call(*input, **kwargs)
File "ASR/zipformer/subsampling.py", line 309, in forward
x = self.conv(x)
File "icefall/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1128, in _call_impl
result = forward_call(*input, **kwargs)
File "icefall/lib/python3.8/site-packages/torch/nn/modules/container.py", line 141, in forward
input = module(input)
File "icefall/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1131, in _call_impl
hook_result = hook(self, input, result)
File "icefall/icefall/hooks.py", line 43, in forward_hook
raise ValueError(
ValueError: The sum of module.encoder_embed.conv.0.output is not finite: tensor([[[[nan, nan, nan, ..., nan, nan, nan],

danpovey · 2024-05-07T02:18:48Z

I meant the normal training log output leading up to that point, not the error traceback.
(But yeah, for max-duration of 600 and world-size 3, perhaps that base-lr is too large, could
try 0.03 for instance.)

zw76859420 · 2024-05-07T02:41:31Z

Thanks Dan, my idol.
Zengwei also gave me the same suggestion, and we now modify the base-lr to 0.035 based on the above configs.
We will show you the training results as soon as possible.

zw76859420 · 2024-05-07T05:24:26Z

I meant the normal training log output leading up to that point, not the error traceback. (But yeah, for max-duration of 600 and world-size 3, perhaps that base-lr is too large, could try 0.03 for instance.)

It works!!!
When I reduced base-lr from 0.045 to 0.030, the loss is currently dropping quite normally.

zw76859420 closed this as completed May 7, 2024

zw76859420 mentioned this issue May 10, 2024

CTC/AED PROBLEMS IN EXPORTING JIT MODULE #1623

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CTC/AED PROBLEM IN K2 #1618

CTC/AED PROBLEM IN K2 #1618

zw76859420 commented May 6, 2024

marcoyang1998 commented May 6, 2024

zw76859420 commented May 6, 2024

zw76859420 commented May 6, 2024

danpovey commented May 6, 2024 via email

yaozengwei commented May 6, 2024

zw76859420 commented May 7, 2024

zw76859420 commented May 7, 2024

danpovey commented May 7, 2024 •

edited

Loading

zw76859420 commented May 7, 2024

zw76859420 commented May 7, 2024

CTC/AED PROBLEM IN K2 #1618

CTC/AED PROBLEM IN K2 #1618

Comments

zw76859420 commented May 6, 2024

marcoyang1998 commented May 6, 2024

zw76859420 commented May 6, 2024

zw76859420 commented May 6, 2024

danpovey commented May 6, 2024 via email

yaozengwei commented May 6, 2024

zw76859420 commented May 7, 2024

zw76859420 commented May 7, 2024

danpovey commented May 7, 2024 • edited Loading

zw76859420 commented May 7, 2024

zw76859420 commented May 7, 2024

danpovey commented May 7, 2024 •

edited

Loading