Datasets and scripts for basic natural language and speech processing.
This is not an official Google product.
Directory | Language Available |
---|---|
af | Afrikaans |
bn | Bengali / Bangla |
hi_ur | Hindi & Urdu |
is | Icelandic |
jv | Javanese |
km | Khmer |
lo | Lao |
my | Burmese / Myanmar |
ne | Nepali |
si | Sinhala |
su | Sundanese |
xh | Xhosa |
zu | Zulu |
We are including a few tools for working with the natural language datasets. These tools are written in C++ and Python and are built with Bazel. To compile and use these tools, install a recent version of Bazel (minimally Bazel release 0.4.5 is required).
Resource | Link |
---|---|
Sinhala TTS recordings (~3K) | https://www.openslr.org/30/ |
TTS recordings for four South African languages (af, st, tn, xh) | https://www.openslr.org/32/ |
Large Javanese ASR training data set (~185K) | https://www.openslr.org/35/ |
Large Sundanese ASR training data set (~220K) | https://www.openslr.org/36/ |
High quality TTS data for Bengali languages | https://www.openslr.org/37/ |
High quality TTS data for Javanese | https://www.openslr.org/41/ |
High quality TTS data for Khmer | https://www.openslr.org/42/ |
High quality TTS data for Nepali | https://www.openslr.org/43/ |
High quality TTS data for Sundanese | https://www.openslr.org/44/ |
Large Sinhala ASR training data set | https://www.openslr.org/52/ |
Large Bengali ASR training data set | https://www.openslr.org/53/ |
Large Nepali ASR training data set | https://www.openslr.org/54/ |
Crowdsourced high-quality Argentinian Spanish speech data set | https://www.openslr.org/61/ |
Crowdsourced high-quality Malayalam multi-speaker speech data set | https://www.openslr.org/63/ |
Crowdsourced high-quality Marathi multi-speaker speech data set | https://www.openslr.org/64/ |
Crowdsourced high-quality Tamil multi-speaker speech data set | https://www.openslr.org/65/ |
Crowdsourced high-quality Telugu multi-speaker speech data set | https://www.openslr.org/66/ |
Data set which contains recordings of Catalan | https://www.openslr.org/69 |
Crowdsourced high-quality Nigerian English speech data set | https://www.openslr.org/70 |
Crowdsourced high-quality Chilean Spanish speech data set | https://www.openslr.org/71 |
Crowdsourced high-quality Colombian Spanish speech data set | https://www.openslr.org/72 |
Crowdsourced high-quality Peruvian Spanish speech data set | https://www.openslr.org/73 |
Crowdsourced high-quality Puerto Rico Spanish speech data set | https://www.openslr.org/74 |
Crowdsourced high-quality Venezuelan Spanish speech data set | https://www.openslr.org/75 |
Crowdsourced high-quality Basque speech data set | https://www.openslr.org/76 |
Crowdsourced high-quality Galician speech data set | https://www.openslr.org/77 |
Crowdsourced high-quality Gujarati multi-speaker speech data set | https://www.openslr.org/78 |
Crowdsourced high-quality Kannada multi-speaker speech data set | https://www.openslr.org/79 |
Crowdsourced high-quality Burmese speech data set | https://www.openslr.org/80 |
Data set which contains male and female recordings of English from various dialects of the UK and Ireland. | https://www.openslr.org/83 |
Crowdsourced high-quality Yoruba speech data set | https://www.openslr.org/86 |
SLTU 2016 Tutorial - https://sites.google.com/site/sltututorial/overview
-
Crowdsourcing Latin American Spanish for Low-Resource Text-to-Speech
-
Open-source Multi-speaker Corpora of the English Accents in the British Isles
-
Open-Source High Quality Speech Datasets for Basque, Catalan and Galician
-
Text Normalization for Bangla, Khmer, Nepali, Javanese, Sinhala, and Sundanese TTS Systems
-
FonBund: A Library for Combining Cross-lingual Phonological Segment Data
-
Building Open Javanese and Sundanese Corpora for Multilingual Text-to-Speech
-
Rapid development of TTS corpora for four South African languages
-
Building Statistical Parametric Multi-speaker Synthesis for Bangladeshi Bangla
Unless otherwise noted, all original files are licensed under an Apache License, Version 2.0.
Where specifically noted, some datasets are licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0).
The directory third_party/ contains third-party works, which we are including under the respective licenses of the upstream projects. See third_party/README.md for further details.