BibleTTS is a large high-quality open Text-to-Speech dataset with up to 80 hours of single speaker, studio quality 48kHz recordings for each language. We release aligned speech and text for six languages spoken in Sub-Saharan Africa, with unaligned data available for four additional languages, derived from the Biblica open.bible project. The data is released under a commercial-friendly CC-BY-SA license.
The BibleTTS corpus consists of high-quality audio released as 48kHz, 24-bit, mono-channel FLAC files. Recordings for each language consist of a single speaker recorded under professional quality, close-microphone conditions (i.e., without background noise or echo). BibleTTS is rare among public speech corpora for the volume of data available per speaker and the audio quality for creating TTS models. Furthermore, the corpus consists of languages which are under-represented in today’s voice technology landscape, both in academia and in industry.
Our aligned data is publicly available on OpenSLR.
Unaligned Hours | Aligned Hours | Aligned Verses | Sample | |
---|---|---|---|---|
Ewe | 100.1 | 86.8 | 24,957 | listen |
Hausa | 103.2 | 86.6 | 40,603 | listen |
Kikuyu | 90.6 | -- | -- | -- |
Lingala | 151.7 | 71.6 | 15,117 | listen |
Luganda | 110.4 | -- | -- | -- |
Luo | 80.4 | -- | -- | -- |
Chichewa | 115.9 | -- | -- | -- |
Akuapem Twi | 75.7 | 67.1 | 28,238 | listen |
Asante Twi | 82.6 | 74.9 | 29,021 | listen |
Yoruba | 93.6 | 33.3 | 10,228 | listen |
All trained models are integrated to Coqui TTS and can be demoed at huggingface spaces:
https://huggingface.co/spaces/coqui/CoquiTTS
All models are end-to-end VITS speech synthesis models trained as described in the paper.
TTS samples coming soon!
Model checkpoint | Config file | In-domain sample | Out-of-domain sample | |
---|---|---|---|---|
Ewe | link | link | listen | listen |
Hausa | link | link | 1, 2, 3 | listen |
Kikuyu | -- | -- | -- | -- |
Lingala | link | link | listen | listen |
Luganda | -- | -- | -- | -- |
Luo | -- | -- | -- | -- |
Chichewa | -- | -- | -- | -- |
Akuapem Twi | link | link | listen | -- |
Asante Twi | link | link | listen | listen |
Yoruba | link | link | listen | listen |
- Segmentation using existing verse timestamps (Sec 4.1.1)
- Forced alignment using pre-trained acoustic models (Sec 4.1.2)
- Forced alignment from scratch (Sec 4.1.3)
Data-checker code for outlier detection (Sec 4.2)
VITS TTS models were trained with coqui-ai (Sec 5)