Other considerations for what makes a good TTS dataset? #26

wanshun123 · 2019-03-09T15:42:58Z

I have done a lot of training on different self-made datasets (typically having around 3 hours of audio across a few thousand .wav files, all 22050 Hz) using Tacotron, starting from a pretrained LJSpeech model (using the same hyperparameters each time and to a similar number of steps) and am very confused why for some datasets the output audio ends up being very clear for many samples - sometimes even indistinguishable from the actual person speaking - and for other datasets the synthesised audio always has choppy aberrations. In all my datasets there is no beginning/ending silence, transcriptions are all correct, and the datasets have fairly similar phenome distributions (and similar character length graphs) according to analyze.py in this repo (thanks for making that by the way).

To take an example from publicly available datasets: on https://keithito.github.io/audio-samples/ one can hear that the model trained on the Nancy Corpus sounds significantly less robotic and is clearer than the model trained on LJ Speech. Here https://syang1993.github.io/gst-tacotron/ is samples for a model trained on Blizzard 2013 on tacotron with extremely good quality compared to any samples I've heard from a model trained on LJ Speech using Tacotron, even though the Blizzard 2013 dataset used there is smaller than LJ Speech. Why might this be?

Any comments appreciated.

The text was updated successfully, but these errors were encountered:

el-tocino · 2019-03-09T17:06:26Z

This echoes some of what you've noticed:
https://www.reddit.com/r/MachineLearning/comments/a90u3t/d_what_makes_a_good_texttospeech_dataset/ (see comment by erogol)
Also of note is using the waveglow/wavernn tools can help smooth things out as well.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Other considerations for what makes a good TTS dataset? #26

Other considerations for what makes a good TTS dataset? #26

wanshun123 commented Mar 9, 2019

el-tocino commented Mar 9, 2019

Other considerations for what makes a good TTS dataset? #26

Other considerations for what makes a good TTS dataset? #26

Comments

wanshun123 commented Mar 9, 2019

el-tocino commented Mar 9, 2019