What do the loss curves look like during your successful training? #16

YuXiangLin1234 · 2024-07-29T06:03:30Z

Hello,

I've attempted to train FAcodec using my own dataset. However, whether I start from scratch or fine-tune your provided checkpoint, the reconstructed audio clips are just noise. I fine-tuned the model using around 128 hours of Common Voice 18 ZH-TW data. After approximately 20k steps, the loss seemed to converge. Some losses, like feature loss, decreased successfully, while others, such as mel loss and waveform loss, were oscillating.

Do all losses decrease during your training process?

Plachtaa · 2024-07-29T15:21:54Z

Could you please share your voice examples and loss curves? I believe they can help for analyzing the issue you encountered

YuXiangLin1234 · 2024-07-29T18:44:12Z

The loss curve looks like:

The audio samples are as follows:
https://huggingface.co/datasets/mozilla-foundation/common_voice_16_0/viewer/zh-TW

The reconstructed audio sample:
https://drive.google.com/file/d/1yk_xZL17FkhIYMjojesd-PHWyAKuqzSA/view?usp=sharing

Plachtaa · 2024-07-30T08:35:31Z

According to the mel_loss in the loss curve you shared, the model seems to have converged well.
However, the reconstructed audio samples sounds to be generated by a randomly initialized model.
May I know whether the reconstructed sample is retrieved from tensorboard or through another reconstruction script?

ndhuynh02 · 2024-11-17T14:05:39Z

Hi there! Thank you for the training code. The original paper doesn't provide this but the model and checkpoint; hence this is very helpful. But I am facing some training problem.

I am trying to train this model from scratch. However, instead of using the provided code, I have changed it so it is more similar to the original paper. Where are some modifications that I have done:

Loss function weights so it matches the Appendix part
Audio sample rate. It seems like this code is using sample rate of 24kHz, so I changed it to 16kHz like in the paper
Hop length and down-sample rate. Since the audio's sample rate is modified, I also need to change those 2 to be 200 and [2, 4, 5, 5], respectively. In addition, my n_mels is 128 instead of 80.
Pitch extractor. The code is using pretrained JDC model to predict F0 for label. Therefore, I used to original approach that this
PitchExtractor model used to train
Phoneme extractor. I am seeing that this code is utilizing Wav2Vec model to get label for phoneme quantizer. However, I am using the Montreal Forced Aligner instead. This means phonemes are in CMU format, not IPA like Wav2vec
The max frame used for training in this code is 80. I see that a bit small, so I increase it to 512 frames.

When use these modifications to train the model, it doesn't converge, the output is nonsense, all phonemes are the same in the Predictor and the codebook loss is huge.
Can anybody help me fix this problem?

Plachtaa · 2024-11-17T14:35:03Z

Hi there! Thank you for the training code. The original paper doesn't provide this but the model and checkpoint; hence this is very helpful. But I am facing some training problem.

I am trying to train this model from scratch. However, instead of using the provided code, I have changed it so it is more similar to the original paper. Where are some modifications that I have done:

Loss function weights so it matches the Appendix part

Audio sample rate. It seems like this code is using sample rate of 24kHz, so I changed it to 16kHz like in the paper

Hop length and down-sample rate. Since the audio's sample rate is modified, I also need to change those 2 to be 200 and [2, 4, 5, 5], respectively. In addition, my n_mels is 128 instead of 80.

Pitch extractor. The code is using pretrained JDC model to predict F0 for label. Therefore, I used to original approach that this
PitchExtractor model used to train

Phoneme extractor. I am seeing that this code is utilizing Wav2Vec model to get label for phoneme quantizer. However, I am using the Montreal Forced Aligner instead. This means phonemes are in CMU format, not IPA like Wav2vec

The max frame used for training in this code is 80. I see that a bit small, so I increase it to 512 frames.

When use these modifications to train the model, it doesn't converge, the output is nonsense, all phonemes are the same in the Predictor and the codebook loss is huge. Can anybody help me fix this problem?

I understand your thoughts but I strongly recommend you to start from existing code that has been proved to be working, then you can make your desired changes step by step, or else it's impossible to find the cause

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What do the loss curves look like during your successful training? #16

What do the loss curves look like during your successful training? #16

YuXiangLin1234 commented Jul 29, 2024

Plachtaa commented Jul 29, 2024

YuXiangLin1234 commented Jul 29, 2024 •

edited

Loading

Plachtaa commented Jul 30, 2024

ndhuynh02 commented Nov 17, 2024 •

edited

Loading

Plachtaa commented Nov 17, 2024

What do the loss curves look like during your successful training? #16

What do the loss curves look like during your successful training? #16

Comments

YuXiangLin1234 commented Jul 29, 2024

Plachtaa commented Jul 29, 2024

YuXiangLin1234 commented Jul 29, 2024 • edited Loading

Plachtaa commented Jul 30, 2024

ndhuynh02 commented Nov 17, 2024 • edited Loading

Plachtaa commented Nov 17, 2024

YuXiangLin1234 commented Jul 29, 2024 •

edited

Loading

ndhuynh02 commented Nov 17, 2024 •

edited

Loading