Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Finetune XTTS for new languages #3992

Open
anhnh2002 opened this issue Sep 8, 2024 · 20 comments
Open

Finetune XTTS for new languages #3992

anhnh2002 opened this issue Sep 8, 2024 · 20 comments
Labels
feature request feature requests for making TTS better.

Comments

@anhnh2002
Copy link

anhnh2002 commented Sep 8, 2024

Hello everyone, below is my code for fine-tuning XTTS for a new language. It works well in my case with over 100 hours of audio.

https://github.com/nguyenhoanganh2002/XTTSv2-Finetuning-for-New-Languages

@anhnh2002 anhnh2002 added the feature request feature requests for making TTS better. label Sep 8, 2024
@jamestech-cmyk
Copy link

Hello everyone, below is my code for fine-tuning XTTS for a new language. It works well in my case with over 100 hours of audio.

https://github.com/nguyenhoanganh2002/XTTSv2-Finetuning-for-New-Languages

Hello, man. I'm very pleased with your contribution. Can you provide your trained models? I want to check if they are working well.

@anhnh2002
Copy link
Author

Hello everyone, below is my code for fine-tuning XTTS for a new language. It works well in my case with over 100 hours of audio.
https://github.com/nguyenhoanganh2002/XTTSv2-Finetuning-for-New-Languages

Hello, man. I'm very pleased with your contribution. Can you provide your trained models? I want to check if they are working well.

Due to copyright issues, I am currently unable to share the model's weights with you. I apologize for the inconvenience.

@jamestech-cmyk
Copy link

How long did it take you to train 100 hours of audio, and can you tell me your current computer configuration?

@anhnh2002
Copy link
Author

anhnh2002 commented Sep 8, 2024

it took over 8 hours to train 100 hours of audio on single A100 40Gb

How long did it take you to train 100 hours of audio, and can you tell me your current computer configuration?

@mohataher
Copy link

Due to copyright issues, I am currently unable to share the model's weights with you.

Understandable. However, will you be able to share a snippet audio of what the model has produced?

@anhnh2002
Copy link
Author

Due to copyright issues, I am currently unable to share the model's weights with you.

Understandable. However, will you be able to share a snippet audio of what the model has produced?

Please find the relevant file at the following Google Drive link:

View File

@rose07
Copy link

rose07 commented Oct 15, 2024

@developeranalyser
Copy link

Due to copyright issues, I am currently unable to share the model's weights with you.

Understandable. However, will you be able to share a snippet audio of what the model has produced?

Please find the relevant file at the following Google Drive link:

View File

hi man
what your taken loass ?
and how many step ?

Is it possible to train the xttsv2 model for about 10 hours and can it work well only based on these 10 hours?

Actually, I trained the model with your code and reached a loss of 0.5 and used the model and the output was very bad and nothing was audible. I used google/fleurs dataset for Farsi language.
First, I expanded vocab, then dave training, and then model training for 10,000 steps
What do you think, why am I getting so bad results?

Thank you very much

@anhnh2002
Copy link
Author

Due to copyright issues, I am currently unable to share the model's weights with you.

Understandable. However, will you be able to share a snippet audio of what the model has produced?

Please find the relevant file at the following Google Drive link:
View File

hi man what your taken loass ? and how many step ?

Is it possible to train the xttsv2 model for about 10 hours and can it work well only based on these 10 hours?

Actually, I trained the model with your code and reached a loss of 0.5 and used the model and the output was very bad and nothing was audible. I used google/fleurs dataset for Farsi language. First, I expanded vocab, then dave training, and then model training for 10,000 steps What do you think, why am I getting so bad results?

Thank you very much

First, I recommend you do not train DVAE (because you have a small amount of data). And I think 10 hours is not enough; it makes the model overfit with your data. The losses I got are about 0.8.

@developeranalyser
Copy link

thanks for your good job and reply
i do that and
loss :
| > avg_loader_time: 0.18475866317749023 (+0.00680994987487793)
| > avg_loss_text_ce: 0.036836352199316025 (-0.0016442164778709412)
| > avg_loss_mel_ce: 0.03139156103134155 (-0.001425366848707199)
| > avg_loss: 0.06822791695594788 (-0.003069579601287842)

but after inference
one of sentence that trained on
i get worse audio that not in trained lang
And even the sound that is produced is not close to the trained language at all

result.zip
result.zip

@developeranalyser
Copy link

How many epochs and steps are required for training on 100 hours of data? And it took a few hours my friend

@kunibald413
Copy link

Hi, nice work!
You might want to try to create a merge request for it into a still maintained fork of coqui-ai: https://github.com/idiap/coqui-ai-TTS

I'm not involved with it, just an idea.

@anhnh2002
Copy link
Author

How many epochs and steps are required for training on 100 hours of data? And it took a few hours my friend

2 epochs work well for me

@developeranalyser
Copy link

2 epochs work well for me
for new lang , after train we need train vocoder ?

and If lose decreases and becomes less than 1, but it still reads the text incorrectly, what is your opinion about this? What do you advise me to do to solve this problem, maybe my important problem is solved
thank you

@developeranalyser
Copy link

I don't want to train the model on the whole language

I want to teach on limited sentences of a new language
For example, on 1000 sentences
What is your opinion about this??? Is it possible??

@anhnh2002
Copy link
Author

I don't want to train the model on the whole language

I want to teach on limited sentences of a new language

For example, on 1000 sentences

What is your opinion about this??? Is it possible??

I think it's impossible to overfit the model with only 1000 sentences, especially for a new language. You'd need to extend the tokenizer and likely train a base model on a larger dataset of that language first.

@developeranalyser
Copy link

I think it's impossible to overfit the model with only 1000 sentences, especially for a new language. You'd need to extend the tokenizer and likely train a base model on a larger dataset of that language first.

Thank you very much, so your opinion is that my problem is the small amount of data and I cannot get good results from this model that I have trained on few sentences and it must be trained on a large amount of data.
I expanded vocab and taught dave
Honestly, I wanted to test first that the model is trained on little data and how the result will be, then run it on a lot of data.
Another question I have is how much lr should I put??? That the learning of the model for other languages ​​is not lost and that the model learns well and quickly for a new language and on a lot of data

Thank you for paying the zakat of your knowledge :)

@developeranalyser
Copy link

I don't want to train the model on the whole language

I want to teach on limited sentences of a new language
For example, on 1000 sentences
What is your opinion about this??? Is it possible??

I think it's impossible to overfit the model with only 1000 sentences, especially for a new language. You'd need to extend the tokenizer and likely train a base model on a larger dataset of that language first.

In short, teaching a language with 10 letters and about 100 sentences is not possible? So that the model reads these 100 trained sentences correctly?

@NathanTrance
Copy link

Hey, great work!

I am having a question: I want to train this model on Vietnamese, but with vi-north and vi-south as separate languages and have separate metadata csvs for them. Does the multidataset training option support this and shuffle both the vi-north and vi-south data together with separate languages beforehand?

Thank you in advance!

@anhnh2002
Copy link
Author

Yes, you can

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request feature requests for making TTS better.
Projects
None yet
Development

No branches or pull requests

7 participants