Yorùbá Audio

This repo aggregates audio/speech corpora for Yorùbá tasks, similarly to the yoruba-text for text datasets. The corpora may contain aligned text or be purely unlabeled.

The objective is to have a bird's eye view of available Yorùbá audio, and it's metadata and entropy, to inform additional data collection tasks & modeling. For example, if we see a large Broadcast news corpus, we might be interested to train a self-supervised model on a pretext task to generate speech embeddings for use in ASR/TTS work.

Corpora

Name	Size in HH:MM:SS	Transcribed	Segmented in utterances	Aligned	Source
Lagos-NWU	02:45:17	✔️	✔️	✔️	North-West University
OpenSLR86	04:1:31	✔️	✔️	✔️	OpenSLR, Google
Bíbélì Mímọ́ (NIV)	93:38:15	✔️			Biblica Open Bible
Bíbélì Mímọ́ (KJV)		✔️			Bible.is
Colloquial Yorùbá	02:32:29	✔️			Audio files, Textbook
OrisunTV Broadcast News	81:49:29				Youtube
VoxLingua107	94:2:45		✔️		post-filtered from Youtube

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Yorùbá Audio

Corpora

About

Releases

Packages

License

Niger-Volta-LTI/yoruba-audio

Folders and files

Latest commit

History

Repository files navigation

Yorùbá Audio

Corpora

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages