RuntimeError: Model diverged with loss = NaN. #890

FPBHW · 2021-10-06T14:11:18Z

Hi,

this might be more of a question than an issue, but I'm observing loss = NaN, and am not clear how this could be, at least not from a theoretical standpoint.

It is possible for loss to be "INFinite" when a prediction is a hard 0 or 1, and the target is the opposite. However, a NaN should - as far as I understand - not occur.

In summary:

-log(0) = inf (not observed, inconvenient, but understandable)
-log(x) = NaN -> Observed, but I do not know for which x this could happen (assuming x >= 0)

Does anyone have an explanation ?

Using classic SGD OpenNMT-tf-2.20.1, NaN loss encountered after 49900 steps.

Many thanks in advance,
Fokko

The text was updated successfully, but these errors were encountered:

FPBHW · 2021-10-06T14:14:28Z

Hypothesis: inside the model, positive and negative INFs are generated and then summed, leading to a prediction of NaN. If this is the case, it would be better to report the prediction as NaN.

guillaumekln · 2021-10-06T14:59:49Z

Hypothesis: inside the model, positive and negative INFs are generated and then summed, leading to a prediction of NaN.

Yes, for example:

>>> import tensorflow as tf
>>> x = tf.constant([float("inf")], dtype=tf.float32)
>>> tf.nn.softmax(x)
<tf.Tensor: shape=(1,), dtype=float32, numpy=array([nan], dtype=float32)>

If this is the case, it would be better to report the prediction as NaN.

What do you mean exactly? What should be improved in the current way of reporting a training that diverges?

FPBHW · 2021-10-07T13:19:27Z

Thank you - indeed, a input of inf would indeed also cause a prediciton of NaN (and thus a loss of NaN). What I meant to say is that it would be nice to have feedback about a NaN (or even inf) as early as possible, so it is more clear what the cause is.

If this the gradient is +-Inf for a particular sample, it would be very helpful if that sample is flagged. This is if the divergence can be explained by bad data.

This issue popped up in 2 different trainings with different data and SGD methods, and both in a training with a gradually decreasing loss of moderate scale, that then suddenly diverged. So the NaN is a surprise.

A potential alternative hypothesis is that there is a bug somewhere (although data issues are more likely).

Do you have a suggestion how we would get insight into what suddenly caused the model to diverge while it seemed to be converging normally ? Particularly how to flag the data under consideration at that moment ?

FPBHW · 2021-10-07T13:20:41Z

Example:

2021-10-07 17:46:13.224000: I runner.py:281] Step = 1238300 ; steps/s = 3.98, source words/s = 280434, target words/s = 307762 ; Learning rate = 0.000079 ; Loss = 1.851973
2021-10-07 17:46:39.361000: I runner.py:281] Step = 1238400 ; steps/s = 3.83, source words/s = 297033, target words/s = 326020 ; Learning rate = 0.000079 ; Loss = 1.801115
Traceback (most recent call last):
  File "/root/miniconda3/bin/onmt-main", line 8, in <module>
    sys.exit(main())
  File "/root/miniconda3/lib/python3.7/site-packages/opennmt/bin/main.py", line 326, in main
    hvd=hvd,
  File "/root/miniconda3/lib/python3.7/site-packages/opennmt/runner.py", line 281, in train
    moving_average_decay=train_config.get("moving_average_decay"),
  File "/root/miniconda3/lib/python3.7/site-packages/opennmt/training.py", line 123, in __call__
    dataset, accum_steps=accum_steps, report_steps=report_steps
  File "/root/miniconda3/lib/python3.7/site-packages/opennmt/training.py", line 268, in _steps
    raise RuntimeError("Model diverged with loss = NaN.")
RuntimeError: Model diverged with loss = NaN.
[Modelarts Service Log]Training completed.

FPBHW · 2021-10-07T13:24:08Z

Another example, this is with classic SGD:

2021-10-05 22:55:36.368882: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:177] Filling up shuffle buffer (this may take a while): 10 of 500000
2021-10-05 22:55:38.246023: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:230] Shuffle buffer filled.
2021-10-05 22:55:52.652000: I runner.py:281] Step = 49800 ; steps/s = 2.93, source words/s = 52750, target words/s = 42947 ; Learning rate = 0.010000 ; Loss = 4.686182
2021-10-05 22:56:27.645000: I runner.py:281] Step = 49900 ; steps/s = 2.86, source words/s = 52161, target words/s = 42367 ; Learning rate = 0.010000 ; Loss = 4.904016
Traceback (most recent call last):
  File "/root/miniconda3/bin/onmt-main", line 8, in <module>
    sys.exit(main())
  File "/root/miniconda3/lib/python3.7/site-packages/opennmt/bin/main.py", line 326, in main
    hvd=hvd,
  File "/root/miniconda3/lib/python3.7/site-packages/opennmt/runner.py", line 281, in train
    moving_average_decay=train_config.get("moving_average_decay"),
  File "/root/miniconda3/lib/python3.7/site-packages/opennmt/training.py", line 123, in __call__
    dataset, accum_steps=accum_steps, report_steps=report_steps
  File "/root/miniconda3/lib/python3.7/site-packages/opennmt/training.py", line 268, in _steps
    raise RuntimeError("Model diverged with loss = NaN.")
RuntimeError: Model diverged with loss = NaN.
[Modelarts Service Log]Training completed.

FPBHW · 2021-10-07T13:26:39Z

Using SGD we see the impact of outliers quite clearly, but the sudden divergence is still unexpected. If there is a any way to establish if this a data issue, that would be very welcome.

Any suggestions ?

guillaumekln · 2021-10-07T13:39:49Z

Misaligned data can indeed cause spikes in the training loss and then exploding gradients if there is no counter measure. Since it is usually difficult to ensure that all examples in the dataset are cleaned enough, the training parameters should be tuned to prevent divergence from happening.

For example, with SGD it is usually required to clip the gradients to prevent them from exploding. Are you doing that? (See the clip* optimizer parameters.)

FPBHW · 2021-10-07T14:26:40Z

I disabled clipping specifically to be able to determine the root cause of the NaN ...

Is there a way to get the sample or batch, and the associated loss per sample at the moment of the crash ?

guillaumekln · 2021-10-07T14:51:15Z

The training samples seen at the moment of the crash may not be the outliers you are looking for, because the divergence may have been started a few iterations before.

A better way to check your training data for outliers is to train a standard model (e.g. a Transformer with --auto_config) and then run the score command on your training data. The outliers should have a high score relative to the other examples.

FPBHW · 2021-10-11T12:07:12Z

I'm currently trying to run score on the data to figure out which examples are causing the issue.

This resulted in the following crash:

2021-10-11 19:29:13.447309: F tensorflow/core/kernels/softmax_op_gpu.cu.cc:275] Non-OK-status: GpuLaunchKernel( GenerateNormalizedProb<T, acc_type, kUnroll>, numBlocks, numThreadsPerBlock, 0, cu_stream, reinterpret_cast<const T*>(logits_in_.flat<T>().data()), reinterpret_cast<const acc_type*>( sum_probs.flat<acc_type>().data()), reinterpret_cast<const T*>(max_logits.flat<T>().data()), const_cast<T*>(softmax_out->flat<T>().data()), rows, cols, log_) status: Internal: invalid configuration argument
opennmt-training/src/score-filter.sh: line 154:   663 Aborted                 onmt-main --config ${SCORE_CONFIG} --auto_config --checkpoint_path ${LOCAL_MODEL_DIR} ${MODEL_PARAMETER} score --features_file ${LOCAL_TOKEN_DIR}/${TRAIN_SRC_TOK} --predictions_file ${LOCAL_TOKEN_DIR}/${TRAIN_TGT_TOK} > ${PREDICTION_FILE}
[Modelarts Service Log]Training completed.

The configuration is:

data:
  source_vocabulary: /cache/MT/vocab/shrd5k-v0/en-ps.ps.vocab
  target_vocabulary: /cache/MT/vocab/shrd5k-v0/en-ps.en.vocab
eval:
  batch_size: 32
  batch_type: examples
  length_bucket_width: 5
infer:
  batch_size: 150
  batch_type: examples
  length_bucket_width: 5
model_dir: /cache/MT/model
params:
  average_loss_in_time: true
  beam_width: 4
  decay_params:
    model_dim: 512
    warmup_steps: 8000
  decay_type: NoamDecay
  label_smoothing: 0.1
  learning_rate: 2.0
  num_hypotheses: 1
  optimizer: LazyAdam
  optimizer_params:
    beta_1: 0.9
    beta_2: 0.998
  replace_unknown_target: true
  sampling_temperature: 1
  sampling_topk: 1
score:
  batch_size: 64
  batch_type: examples
  length_bucket_width: 5
train:
  average_last_checkpoints: 8
  batch_size: 3072
  batch_type: tokens
  effective_batch_size: 25000
  keep_checkpoint_max: 8
  length_bucket_width: 1
  max_step: 500000
  maximum_features_length: 100
  maximum_labels_length: 100
  sample_buffer_size: -1
  save_summary_steps: 100

Is this expected given the configuration ?

guillaumekln · 2021-10-12T08:57:14Z

Maybe an empty line in the data file? The score task does not apply any filtering.

FPBHW · 2021-10-12T11:07:08Z

For at least 1 job with this error I am certain that there are no empty lines. If I have updates on this crash, shall I post them in #523 ?

guillaumekln · 2021-10-12T12:25:43Z

Yes, please post updates on the other issue.

I'm going to close this one as there is not much to add about the NaN loss. Feel free to create other issues if you encounter other problems.

guillaumekln added the question label Oct 6, 2021

guillaumekln closed this as completed Oct 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: Model diverged with loss = NaN. #890

RuntimeError: Model diverged with loss = NaN. #890

FPBHW commented Oct 6, 2021

FPBHW commented Oct 6, 2021

guillaumekln commented Oct 6, 2021

FPBHW commented Oct 7, 2021 •

edited

Loading

FPBHW commented Oct 7, 2021

FPBHW commented Oct 7, 2021

FPBHW commented Oct 7, 2021

guillaumekln commented Oct 7, 2021

FPBHW commented Oct 7, 2021

guillaumekln commented Oct 7, 2021 •

edited

Loading

FPBHW commented Oct 11, 2021

guillaumekln commented Oct 12, 2021

FPBHW commented Oct 12, 2021

guillaumekln commented Oct 12, 2021

RuntimeError: Model diverged with loss = NaN. #890

RuntimeError: Model diverged with loss = NaN. #890

Comments

FPBHW commented Oct 6, 2021

FPBHW commented Oct 6, 2021

guillaumekln commented Oct 6, 2021

FPBHW commented Oct 7, 2021 • edited Loading

FPBHW commented Oct 7, 2021

FPBHW commented Oct 7, 2021

FPBHW commented Oct 7, 2021

guillaumekln commented Oct 7, 2021

FPBHW commented Oct 7, 2021

guillaumekln commented Oct 7, 2021 • edited Loading

FPBHW commented Oct 11, 2021

guillaumekln commented Oct 12, 2021

FPBHW commented Oct 12, 2021

guillaumekln commented Oct 12, 2021

FPBHW commented Oct 7, 2021 •

edited

Loading

guillaumekln commented Oct 7, 2021 •

edited

Loading