Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What do the loss curves look like during your successful training? #16

Open
YuXiangLin1234 opened this issue Jul 29, 2024 · 5 comments
Open

Comments

@YuXiangLin1234
Copy link

Hello,

I've attempted to train FAcodec using my own dataset. However, whether I start from scratch or fine-tune your provided checkpoint, the reconstructed audio clips are just noise. I fine-tuned the model using around 128 hours of Common Voice 18 ZH-TW data. After approximately 20k steps, the loss seemed to converge. Some losses, like feature loss, decreased successfully, while others, such as mel loss and waveform loss, were oscillating.

Do all losses decrease during your training process?

@Plachtaa
Copy link
Owner

Could you please share your voice examples and loss curves? I believe they can help for analyzing the issue you encountered

@YuXiangLin1234
Copy link
Author

YuXiangLin1234 commented Jul 29, 2024

The loss curve looks like:

image

The audio samples are as follows:
https://huggingface.co/datasets/mozilla-foundation/common_voice_16_0/viewer/zh-TW

The reconstructed audio sample:
https://drive.google.com/file/d/1yk_xZL17FkhIYMjojesd-PHWyAKuqzSA/view?usp=sharing

@Plachtaa
Copy link
Owner

According to the mel_loss in the loss curve you shared, the model seems to have converged well.
However, the reconstructed audio samples sounds to be generated by a randomly initialized model.
May I know whether the reconstructed sample is retrieved from tensorboard or through another reconstruction script?

@ndhuynh02
Copy link

ndhuynh02 commented Nov 17, 2024

Hi there! Thank you for the training code. The original paper doesn't provide this but the model and checkpoint; hence this is very helpful. But I am facing some training problem.

I am trying to train this model from scratch. However, instead of using the provided code, I have changed it so it is more similar to the original paper. Where are some modifications that I have done:

  • Loss function weights so it matches the Appendix part
  • Audio sample rate. It seems like this code is using sample rate of 24kHz, so I changed it to 16kHz like in the paper
  • Hop length and down-sample rate. Since the audio's sample rate is modified, I also need to change those 2 to be 200 and [2, 4, 5, 5], respectively. In addition, my n_mels is 128 instead of 80.
  • Pitch extractor. The code is using pretrained JDC model to predict F0 for label. Therefore, I used to original approach that this
    PitchExtractor model used to train
  • Phoneme extractor. I am seeing that this code is utilizing Wav2Vec model to get label for phoneme quantizer. However, I am using the Montreal Forced Aligner instead. This means phonemes are in CMU format, not IPA like Wav2vec
  • The max frame used for training in this code is 80. I see that a bit small, so I increase it to 512 frames.

When use these modifications to train the model, it doesn't converge, the output is nonsense, all phonemes are the same in the Predictor and the codebook loss is huge.
Can anybody help me fix this problem?

@Plachtaa
Copy link
Owner

Hi there! Thank you for the training code. The original paper doesn't provide this but the model and checkpoint; hence this is very helpful. But I am facing some training problem.

I am trying to train this model from scratch. However, instead of using the provided code, I have changed it so it is more similar to the original paper. Where are some modifications that I have done:

  • Loss function weights so it matches the Appendix part
  • Audio sample rate. It seems like this code is using sample rate of 24kHz, so I changed it to 16kHz like in the paper
  • Hop length and down-sample rate. Since the audio's sample rate is modified, I also need to change those 2 to be 200 and [2, 4, 5, 5], respectively. In addition, my n_mels is 128 instead of 80.
  • Pitch extractor. The code is using pretrained JDC model to predict F0 for label. Therefore, I used to original approach that this
    PitchExtractor model used to train
  • Phoneme extractor. I am seeing that this code is utilizing Wav2Vec model to get label for phoneme quantizer. However, I am using the Montreal Forced Aligner instead. This means phonemes are in CMU format, not IPA like Wav2vec
  • The max frame used for training in this code is 80. I see that a bit small, so I increase it to 512 frames.

When use these modifications to train the model, it doesn't converge, the output is nonsense, all phonemes are the same in the Predictor and the codebook loss is huge. Can anybody help me fix this problem?

I understand your thoughts but I strongly recommend you to start from existing code that has been proved to be working, then you can make your desired changes step by step, or else it's impossible to find the cause

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants