Voice Cloning: When to Fine-Tune Pretrained TTS Models and How Much Data is Needed? #102

ClaudiuFilip110 · 2024-10-10T19:05:22Z

ClaudiuFilip110
Oct 10, 2024

Hi, I'm a ML Engineer (with a few years experience, but) new to TTS (and audio ML). I have experience primarily with NLP and LLMs, but I’m working on a Voice Cloning project, and the transition into TTS has been a bit confusing.

Here's my plan, if you have any tips or suggestions please feel free to add them here.

English Voice Cloning

I'm starting with pre-trained voice cloning models (xTTS, VITS, etc), but I want to improve their performance. Do I use RVC to do that or do I go straight to fine-tuning?
If I need to fine-tune the model to better match a particular voice, do I need a dataset specific to each user? In other words, will I need to create a separate dataset for each user, or can a single dataset achieve better results?
How much training data should I have per user for fine-tuning?

Another language Voice Cloning
I'm planning to train my own model from scratch (or from a checkpoint).

create a dataset of similar size to LJSpeech (24hr audio)
have multiple speakers instead of 1
train the model from VITS or xTTS checkpoints

P.S. Any tips are welcome, as I said, I'm quite the novice when it comes to anything audio ML-related.

eginhard · 2024-10-13T20:19:24Z

eginhard
Oct 13, 2024
Maintainer

You can try both
Just fine-tune with all of the data combined
Depends on the data, you'll need to try

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Voice Cloning: When to Fine-Tune Pretrained TTS Models and How Much Data is Needed? #102

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Voice Cloning: When to Fine-Tune Pretrained TTS Models and How Much Data is Needed? #102

ClaudiuFilip110 Oct 10, 2024

Replies: 1 comment

eginhard Oct 13, 2024 Maintainer

ClaudiuFilip110
Oct 10, 2024

eginhard
Oct 13, 2024
Maintainer