Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The loss becomes nan. #80

Open
a1840436478 opened this issue Dec 26, 2024 · 6 comments
Open

The loss becomes nan. #80

a1840436478 opened this issue Dec 26, 2024 · 6 comments

Comments

@a1840436478
Copy link

Hello, when I run to the validation_routine, the loss value increases quickly and then becomes nan, do you know why?

@CrohnEngineer
Copy link
Collaborator

Hey @a1840436478 ,

Looks like you are overfitting to do your training data.
Are you using the same splits that we have used in our paper? All instructions for replicating our experiments are included in the README.md.
Bests,

Edoardo

@a1840436478
Copy link
Author

Yes, I'm using the DFDC dataset, first I run "index_dfdc.py" to generate a pkl file, then I run "extract_faces.py" to extract the face picture, and finally I run "train_binclass.py", the parameters are specified according to the train_all.sh file, because I use Windows, So I'm running with pycharm, and every time I run to "validation_routine", the loss value increases rapidly, and here's a record where I print the loss worth it:[119.45784385909792 -> 19070101454848.348 -> 5.573101113846284e+24 -> nan]. Another thing to add is that when I cancel "net.eval()" the loss value doesn't appear abnormal, but that doesn't solve the problem and I'm very bothered by it.

@a1840436478
Copy link
Author

Yes, I'm using the DFDC dataset, first I run "index_dfdc.py" to generate a pkl file, then I run "extract_faces.py" to extract the face picture, and finally I run "train_binclass.py", the parameters are specified according to the train_all.sh file, because I use Windows, So I'm running with pycharm, and every time I run to "validation_routine", the loss value increases rapidly, and here's a record where I print the loss worth it:[119.45784385909792 -> 19070101454848.348 -> 5.573101113846284e+24 -> nan]. Another thing to add is that when I cancel "net.eval()" the loss value doesn't appear abnormal, but that doesn't solve the problem and I'm very bothered by it.

My dataset is downloaded directly from "https://www.kaggle.com/competitions/deepfake-detection-challenge/data" in the file "all.zip (471.84 GB)" with the folders "dfdc_train_part_0" --> "dfdc_train_part_49".

@CrohnEngineer
Copy link
Collaborator

CrohnEngineer commented Dec 31, 2024

Hey @a1840436478 ,

This looks pretty strange TBH.
I'll ask you some questions to see if we can catch this bug:

  1. Are you using the environment indicated in the environment.yml file?
  2. Have you modified any part of the train_binclass.py script?

Another thing to add is that when I cancel "net.eval()" the loss value doesn't appear abnormal, but that doesn't solve the problem and I'm very bothered by it.

You should not remove the net.eval() instruction, as this allows PyTorch to stop tracking the gradients during validation;
3. Does the validation loss get high immediately or after how many validation routines? Can you provide a plot of the training/validation loss curves?
4. Which model are you trying to train?

We never tested our code on a Windows server, but it should not be a problem.

@a1840436478
Copy link
Author

Okay, but I have one last question, when I extract the face information, if I load the model on the GPU, the image I get is different from the result of using the CPU, where the image obtained by the CPU is correct, and the image obtained by the GPU is completely wrong。Do you know why?

@CrohnEngineer
Copy link
Collaborator

Do you mean that the extract_faces.py script outputs different images if the model is loaded on GPU rather than CPU?
It should not make a difference running Blazeface on different devices as long as the model you are loading is correct (i.e., you are loading the blazeface.pth and anchors.npy correctly as indicated in extract_faces.py).
As far as I remember, we always used the GPU to extract faces from the video frames.
However, if the faces you have extracted make little sense, that might explain why your model is overfitting so badly (:
Can you post some examples here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants