Training diverges #1331

RuABraun · 2023-10-23T14:34:24Z

RuABraun
Oct 23, 2023

I'm seeing training diverge surprisingly often while using hyperparameters that would work for a CTC+attn model.

I'm not using the icefall setup (essentially just using the compute_loss function) but was hoping someone might have experience in knowing what could be responsible.

I'm using what I think is a relatively high warm up period (10k), an even longer learning rate warmup and am rescaling the losses continuously like in the latest zipformer recipe. I'm using an adamw beta2 of 0.95 and my layernorm's are in fp32. I see loss spikes around 11 or 12k steps. Using the modified model, k2 version is very recent.

Curious whether others may have experienced this. I'm sure if I keep lowering the LR it will work but I'm already using a lower than normal LR.

RuABraun · 2023-10-27T04:30:57Z

RuABraun
Oct 27, 2023
Author

I've noticed the blank ratio of the simple AM goes down (and to zero after loss starts spiking). Has someone else had this happen?

6 replies

RuABraun Oct 29, 2023
Author

The divergence happens when the simple loss is already down below 0.2, so I think it has learned an alignment at that point. I experimented with having a faster learning rate warmup to account for the learning period.

What I notice and think is unusual is the blank ratio of the simple encoder output goes to zero (middle graph below).

I was not using any smoothing (lm_only_scale and am_only_scale set to 0), the purple graph where the loss recovers is with am_only_scale=0.1, final model is poor though.

I'm wondering whether the consistently high blank probability (I infer that from the consistently high blank ratio, measured as number of times blank is the argmax divided by numel) from the simple decoder output results in the simple encoder output abdicating responsibility for having to giving probability to blank. The transducer output does not see this behaviour. Curious whether this is normal?

RuABraun Oct 30, 2023
Author

Thinking about it, isn't it kind of wrong for the decoder to output a probability for blank? It has no chance of accurately modeling that and additionally that output will never be used anyways. So it seems to me that output should be turned off, maybe by setting the output of the blank ID prob to some constant (like decoder_out[:, :, blank_id] = -2.3) or disabling the backprop to that element somehow @danpovey ?

danpovey Oct 30, 2023
Maintainer

i'd have to see the code... argmax would only be meaningful if done after adding the encoder and decoder projections.

RuABraun Oct 30, 2023
Author

Right whenever I'm talking about the "encoder/decoder outputting blank" I'm actually talking about after the projections, the code snippet above with decoder_out is not valid my bad.

danpovey Oct 30, 2023
Maintainer

when you say the layernorms are in fp32, does that mean you are using amp or fp16? If using amp and a grad scaler, you should keep track of the scale in the grad scaler. If that gets very small, you can get too much roundoff in gradients causing the model to diverge. Models with attention need to take special care to avoid gradient blowup, such as limiting the norm of the attention scores (see penalize_abs_values_gt())

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training diverges #1331

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 6 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Training diverges #1331

RuABraun Oct 23, 2023

Replies: 1 comment · 6 replies

RuABraun Oct 27, 2023 Author

RuABraun Oct 29, 2023 Author

RuABraun Oct 30, 2023 Author

danpovey Oct 30, 2023 Maintainer

RuABraun Oct 30, 2023 Author

danpovey Oct 30, 2023 Maintainer

RuABraun
Oct 23, 2023

Replies: 1 comment 6 replies

RuABraun
Oct 27, 2023
Author

RuABraun Oct 29, 2023
Author

RuABraun Oct 30, 2023
Author

danpovey Oct 30, 2023
Maintainer

RuABraun Oct 30, 2023
Author

danpovey Oct 30, 2023
Maintainer