This repo aggregates audio/speech corpora for Yorùbá tasks, similarly to the yoruba-text for text datasets. The corpora may contain aligned text or be purely unlabeled.
The objective is to have a bird's eye view of available Yorùbá audio, and it's metadata and entropy, to inform additional data collection tasks & modeling. For example, if we see a large Broadcast news corpus, we might be interested to train a self-supervised model on a pretext task to generate speech embeddings for use in ASR/TTS work.
Name | Size in HH:MM:SS | Transcribed | Segmented in utterances | Aligned | Source |
---|---|---|---|---|---|
Lagos-NWU | 02:45:17 | ✔️ | ✔️ | ✔️ | North-West University |
OpenSLR86 | 04:1:31 | ✔️ | ✔️ | ✔️ | OpenSLR, Google |
Bíbélì Mímọ́ (NIV) | 93:38:15 | ✔️ | Biblica Open Bible | ||
Bíbélì Mímọ́ (KJV) | ✔️ | Bible.is | |||
Colloquial Yorùbá | 02:32:29 | ✔️ | Audio files, Textbook | ||
OrisunTV Broadcast News | 81:49:29 | Youtube | |||
VoxLingua107 | 94:2:45 | ✔️ | post-filtered from Youtube |