A list of open(ish) corpora for Automatic Speech Recognition research and development.
This list has a preference for free (i.e. no $ cost) and truly open corpora (i.e. some kind of Creative Commons license).
However, not all corpora listed here meet those criteria, but all corpora here are accessible and usable for research and/or commercial use. Some paid corpora with restrictive licenses may be included here (i.e. from the LDC), given their wide use in research and industry.
Feel free to propse additions to the list!
CORPUS | LANGUAGES | # HOURS | # SPEAKERS | DOWNLOAD | LICENSE |
---|---|---|---|---|---|
CommonVoice English | English | 582 hours (validated); 803 hours (total) | 33,541 speakers (reported: 10% female / 41% male) | https://voice.mozilla.org/en/datasets | CC-0 |
CommonVoice German | German | 140 hours (validated); 146 hours (total) | 2,249 speakers (reported: 5% female / 76% male) | https://voice.mozilla.org/en/datasets | CC-0 |
CommonVoice French | French | 74 hours (validated); 79 hours (total) | 1,697 speakers (reported: 7% female / 72% male) | https://voice.mozilla.org/en/datasets | CC-0 |
CommonVoice Welsh | Welsh | 21 hours (validated); 22 hours (total) | 365 speakers (reported: 26% female / 43% male) | https://voice.mozilla.org/en/datasets | CC-0 |
CommonVoice Breton | Breton | 2 hours (validated); 7 hours (total) | 82 speakers (reported: 2% female / 43% male) | https://voice.mozilla.org/en/datasets | CC-0 |
CommonVoice Chuvash | Chuvash | <1 hour (validated); 2 hours (total) | 33 speakers (reported: 0% female / 46% male) | https://voice.mozilla.org/en/datasets | CC-0 |
CommonVoice Turkish | Turkish | 5 hours (validated); 6 hours (total) | 203 speakers (reported: 7% female / 75% male) | https://voice.mozilla.org/en/datasets | CC-0 |
CommonVoice Tatar | Tatar | 20 hours (validated); 20 hours (total) | 117 speakers (reported: 2% female / 80% male) | https://voice.mozilla.org/en/datasets | CC-0 |
CommonVoice Kyrgyz | Kyrgyz | 5 hours (validated); 6 hours (total) | 63 speakers (reported: 6% female / 80% male) | https://voice.mozilla.org/en/datasets | CC-0 |
CommonVoice Irish | Irish | 1 hour (validated); 1 hour (total) | 30 speakers (reported: 22% female / 57% male) | https://voice.mozilla.org/en/datasets | CC-0 |
CommonVoice Kabyle | Kabyle | 92 hours (validated); 98 hours (total) | 382 speakers (reported: 17% female / 53% male) | https://voice.mozilla.org/en/datasets | CC-0 |
CommonVoice Catalan | Catalan | 92 hours (validated); 98 hours (total) | 1,639 speakers (reported: 44% female / 38% male) | https://voice.mozilla.org/en/datasets | CC-0 |
CommonVoice Chinese (Taiwan) | Mandarin (Taiwan) | 19 hours (validated); 28 hours (total) | 695 speakers (reported: 35% female / 38% male) | https://voice.mozilla.org/en/datasets | CC-0 |
CommonVoice Slovenian | Slovenian | 1 hour (validated); 3 hours (total) | 18 speakers (reported: 17% female / 82% male) | https://voice.mozilla.org/en/datasets | CC-0 |
CommonVoice Italian | Italian | 15 hours (validated); 19 hours (total) | 313 speakers (reported: 7% female / 67% male) | https://voice.mozilla.org/en/datasets | CC-0 |
CommonVoice Dutch | Dutch | 12 hours (validated); 13 hours (total) | 373 speakers (reported: 2% female / 74% male) | https://voice.mozilla.org/en/datasets | CC-0 |
CommonVoice Hakha Chin | Hakha Chin | 2 hours (validated); 4 hours (total) | 253 speakers (reported: 22% female / 26% male) | https://voice.mozilla.org/en/datasets | CC-0 |
CommonVoice Esperanto | Esperanto | 4 hours (validated); 6 hours (total) | 53 speakers (reported: 10% female / 21% male) | https://voice.mozilla.org/en/datasets | CC-0 |
Yesno | Hebrew | 6 mins | one male | http://www.openslr.org/1/ | CC-0 |
CORPUS | LANGUAGES | # HOURS | # SPEAKERS | DOWNLOAD | LICENSE |
---|---|---|---|---|---|
African Speech Technology English-English Speech Corpus | English | ~21 hours | https://repo.sadilar.org/handle/20.500.12185/283 | CC-BY 2.5 South Africa | |
African Speech Technology isiXhosa Speech Corpus | isiXhosa | ~26 hours | https://repo.sadilar.org/handle/20.500.12185/305 | CC-BY 2.5 South Africa | |
NCHLT Afrikaans | Afrikaans | 56 hours | 210 speakers (98 female / 112 male) | https://repo.sadilar.org/handle/20.500.12185/280 | CC-BY 3.0 |
NCHLT English | English | 56 hours | 210 speakers (100 female / 110 male) | https://repo.sadilar.org/handle/20.500.12185/274 | CC-BY 3.0 |
NCHLT isiNdebele | isiNdebele | 56 hours | 148 speakers (78 female / 70 male) | https://repo.sadilar.org/handle/20.500.12185/272 | CC-BY 3.0 |
NCHLT isiXhosa | isiXhosa | 56 hours | 209 speakers (106 female / 103 male) | https://repo.sadilar.org/handle/20.500.12185/279 | CC-BY 3.0 |
NCHLT isiZulu | isiZulu | 56 hours | 210 speakers (98 female / 112 male) | https://repo.sadilar.org/handle/20.500.12185/275 | CC-BY 3.0 |
NCHLT Sepedi | Sepedi | 56 hours | 210 speakers (100 female / 110 male) | https://repo.sadilar.org/handle/20.500.12185/270 | CC-BY 3.0 |
NCHLT Sesotho | Sesotho | 56 hours | 210 speakers (113 female / 97 male) | https://repo.sadilar.org/handle/20.500.12185/278 | CC-BY 3.0 |
NCHLT Setswana | Setswana | 56 hours | 210 speakers (109 female / 101 male) | https://repo.sadilar.org/handle/20.500.12185/281 | CC-BY 3.0 |
NCHLT Siswati | Siswati | 56 hours | 197 speakers (96 female / 101 male) | https://repo.sadilar.org/handle/20.500.12185/271 | CC-BY 3.0 |
NCHLT Tshivenda | Tshivenda | 56 hours | 208 speakers (83 female / 125 male) | https://repo.sadilar.org/handle/20.500.12185/276 | CC-BY 3.0 |
NCHLT Xitsonga | Xitsonga | 56 hours | 198 speakers (95 female/103 male) | https://repo.sadilar.org/handle/20.500.12185/277 | CC-BY 3.0 |
Lwazi II Cross-lingual Proper Name Corpus | Afrikaans; English; isiZulu; Sesotho | 2 hours 5 mins | 20 speakers | https://repo.sadilar.org/handle/20.500.12185/445 | CC-BY 3.0 |
Lwazi II Proper Name Call Routing Telephone Corpus | English | 2 hours 7 mins | https://repo.sadilar.org/handle/20.500.12185/448 | CC-BY 3.0 | |
Lwazi II Afrikaans Trajectory Tracking Corpus | Afrikaans | 4 hours | one male | https://repo.sadilar.org/handle/20.500.12185/442 | CC-BY 3.0 |
LibriSpeech | English | ~1000 hours | 2484 speakers (1201 female / 1283 male) | http://www.openslr.org/12/ | CC-BY 4.0 |
Zeroth-Korean | Korean | 52.8 hours | 115 speakers | http://www.openslr.org/40/ | CC-BY 4.0 |
Speech Commands | English | 17.8 hours | >1,000 speakers | https://ai.googleblog.com/2017/08/launching-speech-commands-dataset.html | CC-BY 4.0 |
CORPUS | LANGUAGES | # HOURS | # SPEAKERS | DOWNLOAD | LICENSE |
---|---|---|---|---|---|
Iban | Iban | 8 hours | http://www.openslr.org/24/ https://github.com/sarahjuan/iban | CC-BY-SA 2.0 | |
Vystadial | English; Czech | 41 hours; 15 hours | http://www.openslr.org/6/ | CC-BY-SA 3.0 US | |
Free Spoken Digit Dataset | English | 2,000 isolated digits | 4 speakers | https://github.com/Jakobovski/free-spoken-digit-dataset | CC-BY-SA 4.0 |
Google Javanese | Javanese | 296 hours | 1019 speakers | http://www.openslr.org/35/ | CC-BY-SA 4.0 |
Google Nepali | Nepali | 165 hours | 527 speakers | http://www.openslr.org/54/ | CC-BY-SA 4.0 |
Google Bengali | Bengali | 229 hours | 508 speakers | http://www.openslr.org/53/ | CC-BY-SA 4.0 |
Google Sinhala | Sinhala | 224 hours | 478 speakers | http://www.openslr.org/52/ | CC-BY-SA 4.0 |
Google Sundanese | Sundanese | 333 hours | 542 speakers | http://www.openslr.org/36/ | CC-BY-SA 4.0 |
SWC-2017 | English; German; Dutch | 182 hours; 249 hours; 79 hours | 395 speakers; 339 speakers; 145 speakers | https://nats.gitlab.io/swc/ | CC-BY-SA 4.0 |
Chuvash TTS | Chuvash | 4 hours | 1 speaker | https://github.com/ftyers/Turkic_TTS | CC-BY-SA 4.0 |
CORPUS | LANGUAGES | # HOURS | # SPEAKERS | DOWNLOAD | LICENSE |
---|---|---|---|---|---|
IBM Recorded Debates v1 | English | 5 hours | 10 speakers | https://www.research.ibm.com/haifa/dept/vst/debating_data.shtml#Debate%20Speech%20Analysis | CC-BY-ND |
IBM Recorded Debates v2 | English | ~14 hours | 14 speakers | https://www.research.ibm.com/haifa/dept/vst/debating_data.shtml#Debate%20Speech%20Analysis | CC-BY-ND |
CORPUS | LANGUAGES | # HOURS | # SPEAKERS | DOWNLOAD | LICENSE |
---|---|---|---|---|---|
CHiME-Home | English | 6.8 hours | https://archive.org/details/chime-home | CC-BY-NC-SA 3.0 |
CORPUS | LANGUAGES | # HOURS | # SPEAKERS | DOWNLOAD | LICENSE |
---|---|---|---|---|---|
Tatoeba-Eng | English | ~250 hours (rough estimate) | 6 speakers | https://voice.mozilla.org/en/datasets | CC BY-NC 4.0 (some audio) / CC BY-NC-ND 3.0 (most audio) / CC BY 2.0 (all text) |
TED-LIUM | English | 118 hours | 685 speakers (36h female / 81h male) | http://www.openslr.org/7/ | CC-BY-NC-ND 3.0 |
TED-LIUM-2 | English | 207 hours | 1242 speakers (66h female / 141h male) | http://www.openslr.org/19/ | CC-BY-NC-ND 3.0 |
TED-LIUM-3 | English | 452 hours | 2028 speakers (134h female / 316h male) | http://www.openslr.org/51/ | CC-BY-NC-ND 3.0 |
Pansori TEDxKR | Korean | 3 hours | 41 speakers | http://www.openslr.org/58/ | CC-BY-NC-ND 4.0 |
Primewords Mandarin | Mandarin | 100 hours | 296 speakers | http://www.openslr.org/47/ | CC-BY-NC-ND 4.0 |
CORPUS | LANGUAGES | # HOURS | # SPEAKERS | DOWNLOAD | LICENSE |
---|---|---|---|---|---|
VoxForge | English | ~120 hours | ~2966 speakers | http://www.voxforge.org/home/downloads https://voice.mozilla.org/en/datasets | GNU-GPL 3.0 |
CORPUS | LANGUAGES | # HOURS | # SPEAKERS | DOWNLOAD | LICENSE |
---|---|---|---|---|---|
AISHELL-1 | Mandarin | 170 hours | 400 speakers | http://www.openslr.org/33/ | Apache 2.0 |
Tunisian_MSA | Modern Standard Arabic (Tunisia) | 11.2 hours | 118 speakers | http://www.openslr.org/46/ | Apache 2.0 |
African Accented French | French | 22 hours | 232 speakers | http://www.openslr.org/57/ | Apache 2.0 |
CORPUS | LANGUAGES | # HOURS | # SPEAKERS | DOWNLOAD | LICENSE |
---|---|---|---|---|---|
ALFFA | Amharic;Hausa (paid); Swahili; Wolof | http://www.openslr.org/25/ https://github.com/besacier/ALFFA_PUBLIC | MIT |