The loss becomes nan. #80

a1840436478 · 2024-12-26T15:51:05Z

Hello, when I run to the validation_routine, the loss value increases quickly and then becomes nan, do you know why?

CrohnEngineer · 2024-12-27T11:05:15Z

Looks like you are overfitting to do your training data.
Are you using the same splits that we have used in our paper? All instructions for replicating our experiments are included in the README.md.
Bests,

Edoardo

a1840436478 · 2024-12-30T13:24:28Z

Yes, I'm using the DFDC dataset, first I run "index_dfdc.py" to generate a pkl file, then I run "extract_faces.py" to extract the face picture, and finally I run "train_binclass.py", the parameters are specified according to the train_all.sh file, because I use Windows, So I'm running with pycharm, and every time I run to "validation_routine", the loss value increases rapidly, and here's a record where I print the loss worth it:[119.45784385909792 -> 19070101454848.348 -> 5.573101113846284e+24 -> nan]. Another thing to add is that when I cancel "net.eval()" the loss value doesn't appear abnormal, but that doesn't solve the problem and I'm very bothered by it.

a1840436478 · 2024-12-30T23:34:27Z

Yes, I'm using the DFDC dataset, first I run "index_dfdc.py" to generate a pkl file, then I run "extract_faces.py" to extract the face picture, and finally I run "train_binclass.py", the parameters are specified according to the train_all.sh file, because I use Windows, So I'm running with pycharm, and every time I run to "validation_routine", the loss value increases rapidly, and here's a record where I print the loss worth it:[119.45784385909792 -> 19070101454848.348 -> 5.573101113846284e+24 -> nan]. Another thing to add is that when I cancel "net.eval()" the loss value doesn't appear abnormal, but that doesn't solve the problem and I'm very bothered by it.

My dataset is downloaded directly from "https://www.kaggle.com/competitions/deepfake-detection-challenge/data" in the file "all.zip (471.84 GB)" with the folders "dfdc_train_part_0" --> "dfdc_train_part_49".

CrohnEngineer · 2024-12-31T14:46:37Z

Hey @a1840436478 ,

This looks pretty strange TBH.
I'll ask you some questions to see if we can catch this bug:

Are you using the environment indicated in the environment.yml file?
Have you modified any part of the train_binclass.py script?

Another thing to add is that when I cancel "net.eval()" the loss value doesn't appear abnormal, but that doesn't solve the problem and I'm very bothered by it.

You should not remove the net.eval() instruction, as this allows PyTorch to stop tracking the gradients during validation;
3. Does the validation loss get high immediately or after how many validation routines? Can you provide a plot of the training/validation loss curves?
4. Which model are you trying to train?

We never tested our code on a Windows server, but it should not be a problem.

a1840436478 · 2025-01-01T21:01:10Z

Okay, but I have one last question, when I extract the face information, if I load the model on the GPU, the image I get is different from the result of using the CPU, where the image obtained by the CPU is correct, and the image obtained by the GPU is completely wrong。Do you know why？

CrohnEngineer · 2025-01-02T10:22:24Z

Do you mean that the extract_faces.py script outputs different images if the model is loaded on GPU rather than CPU?
It should not make a difference running Blazeface on different devices as long as the model you are loading is correct (i.e., you are loading the blazeface.pth and anchors.npy correctly as indicated in extract_faces.py).
As far as I remember, we always used the GPU to extract faces from the video frames.
However, if the faces you have extracted make little sense, that might explain why your model is overfitting so badly (:
Can you post some examples here?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The loss becomes nan. #80

The loss becomes nan. #80

a1840436478 commented Dec 26, 2024

CrohnEngineer commented Dec 27, 2024

a1840436478 commented Dec 30, 2024

a1840436478 commented Dec 30, 2024

CrohnEngineer commented Dec 31, 2024 •

edited

Loading

a1840436478 commented Jan 1, 2025

CrohnEngineer commented Jan 2, 2025

The loss becomes nan. #80

The loss becomes nan. #80

Comments

a1840436478 commented Dec 26, 2024

CrohnEngineer commented Dec 27, 2024

a1840436478 commented Dec 30, 2024

a1840436478 commented Dec 30, 2024

CrohnEngineer commented Dec 31, 2024 • edited Loading

a1840436478 commented Jan 1, 2025

CrohnEngineer commented Jan 2, 2025

CrohnEngineer commented Dec 31, 2024 •

edited

Loading