The goal of this repository is to collect information and datasets for Ukrainian automatic speech recognition aka speech-to-text.
Also, this repository contains information about Ukrainian speech synthesis aka text-to-speech.
- Join our Speech Recognition Group in Telegram: https://t.me/speech_recognition_uk
- Join our Speech Synthesis Group in Telegram: https://t.me/speech_synthesis_uk
Or you can start a discussion.
You can support our work by donation:
- via Monobank: https://send.monobank.ua/jar/3Saxixsdua
- on Patreon: https://www.patreon.com/yehor_smoliakov
wav2vec2
- 1B params (with language model based on small portion of data): https://huggingface.co/Yehor/wav2vec2-xls-r-1b-uk-with-lm
- 1B params (with language model based on News texts): https://huggingface.co/Yehor/wav2vec2-xls-r-1b-uk-with-news-lm
- 1B params (with binary language model based on News texts): https://huggingface.co/Yehor/wav2vec2-xls-r-1b-uk-with-binary-news-lm
- 1B params (with language model: OSCAR): https://huggingface.co/arampacha/wav2vec2-xls-r-1b-uk
- 1B params (with language model: OSCAR): https://huggingface.co/arampacha/wav2vec2-xls-r-1b-uk-cv
- 300M params (with language model based on small portion of data): https://huggingface.co/Yehor/wav2vec2-xls-r-300m-uk-with-lm
- 300M params (but without language model): https://huggingface.co/robinhad/wav2vec2-xls-r-300m-uk
- 300M params (with language model based on small portion of data): https://huggingface.co/Yehor/wav2vec2-xls-r-300m-uk-with-small-lm
- 300M params (with language model based on small portion of data) and noised data: https://huggingface.co/Yehor/wav2vec2-xls-r-300m-uk-with-small-lm-noisy
- 300M params (with language model based on News texts): https://huggingface.co/Yehor/wav2vec2-xls-r-300m-uk-with-news-lm
- 300M params (with language model based on Wikipedia texts): https://huggingface.co/Yehor/wav2vec2-xls-r-300m-uk-with-wiki-lm
- 90M params (with language model based on small portion of data): https://huggingface.co/Yehor/wav2vec2-xls-r-base-uk-with-small-lm
- 90M params (with language model based on small portion of data): https://huggingface.co/Yehor/wav2vec2-xls-r-base-uk-with-cv-lm
- ONNX model (1B and 300M models): https://github.com/egorsmkv/ukrainian-onnx-model
You can check demos out here: https://github.com/egorsmkv/wav2vec2-uk-demo
Citrinet
- NVIDIA Streaming Citrinet 1024 (uk): https://huggingface.co/nvidia/stt_uk_citrinet_1024_gamma_0_25
- NVIDIA Streaming Citrinet 512 (uk): https://huggingface.co/neongeckocom/stt_uk_citrinet_512_gamma_0_25
ContextNet
-
NVIDIA Streaming ContextNet 512 (uk): https://huggingface.co/theodotus/stt_uk_contextnet_512
FastConformer
-
FastConformer Hybrid Transducer-CTC Large P&C: https://huggingface.co/theodotus/stt_ua_fastconformer_hybrid_large_pc
Squeezeformer
-
Squeezeformer-CTC ML: https://huggingface.co/theodotus/stt_uk_squeezeformer_ctc_ml
-
Squeezeformer-CTC SM: https://huggingface.co/theodotus/stt_uk_squeezeformer_ctc_sm
-
Squeezeformer-CTC XS: https://huggingface.co/theodotus/stt_uk_squeezeformer_ctc_xs
Silero
-
Silero Models (link), a
ua_v3
xxsmall model, see provided colab notebooks and examples, some performance benchmarks here, full optimized / quantized model is ~30MB w/o major quality loss -
Silero v1: https://github.com/snakers4/silero-models (demo code: https://github.com/egorsmkv/ua-silero-demo, also there is a demo as a Telegram bot: https://t.me/ukr_stt_bot)
VOSK
- VOSK v3 nano (with dynamic graph): https://drive.google.com/file/d/1Pwlxmtz7SPPm1DThBPM3u66nH6-Dsb1n/view?usp=sharing (73 mb)
- VOSK v3 small (with dynamic graph): https://drive.google.com/file/d/1Zkambkw2hfpLbMmpq2AR04-I7nhyjqtd/view?usp=sharing (133 mb)
- VOSK v3 (with dynamic graph): https://drive.google.com/file/d/12AdVn-EWFwEJXLzNvM0OB-utSNf7nJ4Q/view?usp=sharing (345 mb)
- VOSK v3: https://drive.google.com/file/d/17umTgQuvvWyUiCJXET1OZ3kWNfywPjW2/view?usp=sharing (343 mb)
- VOSK v2: https://drive.google.com/file/d/1MdlN3JWUe8bpCR9A0irEr-Icc1WiPgZs/view?usp=sharing (339 mb, demo code: https://github.com/egorsmkv/vosk-ukrainian-demo)
- VOSK v1: https://drive.google.com/file/d/1nzpXRd4Gtdi0YVxCFYzqtKKtw_tPZQfK/view?usp=sharing (87 mb, an old model with less trained data)
Note: VOSK models are licensed under Apache License 2.0.
DeepSpeech
- DeepSpeech using transfer learning from English model: https://github.com/robinhad/voice-recognition-ua
- v0.5: https://github.com/robinhad/voice-recognition-ua/releases/tag/v0.5 (1230+ hours)
- v0.4: https://github.com/robinhad/voice-recognition-ua/releases/tag/v0.4 (1230 hours)
- v0.3: https://github.com/robinhad/voice-recognition-ua/releases/tag/v0.3 (751 hours)
whisper
- whisper: https://github.com/openai/whisper
- whisper (small, fine-tuned for Ukrainian): https://github.com/egorsmkv/whisper-ukrainian
- whisper (large, fine-tuned for Ukrainian): https://huggingface.co/arampacha/whisper-large-uk-2
This benchmark uses Common Voice 10 test split.
Model | WER | CER | Accuracy, % | WER+LM | CER+LM | Accuracy+LM, % |
---|---|---|---|---|---|---|
Yehor/wav2vec2-xls-r-1b-uk-with-lm | 0.1807 | 0.0317 | 81.93% | 0.1193 | 0.0218 | 88.07% |
Yehor/wav2vec2-xls-r-1b-uk-with-binary-news-lm | 0.1807 | 0.0317 | 81.93% | 0.0997 | 0.0191 | 90.03% |
Yehor/wav2vec2-xls-r-300m-uk-with-lm | 0.2906 | 0.0548 | 70.94% | 0.172 | 0.0355 | 82.8% |
Yehor/wav2vec2-xls-r-300m-uk-with-news-lm | 0.2027 | 0.0365 | 79.73% | 0.0929 | 0.019 | 90.71% |
Yehor/wav2vec2-xls-r-300m-uk-with-wiki-lm | 0.2027 | 0.0365 | 79.73% | 0.1045 | 0.0208 | 89.55% |
Yehor/wav2vec2-xls-r-base-uk-with-small-lm | 0.4441 | 0.0975 | 55.59% | 0.2878 | 0.0711 | 71.22% |
robinhad/wav2vec2-xls-r-300m-uk | 0.2736 | 0.0537 | 72.64% | - | - | - |
arampacha/wav2vec2-xls-r-1b-uk | 0.1652 | 0.0293 | 83.48% | 0.0945 | 0.0175 | 90.55% |
lm-4gram-500k is used as the LM
Model | WER | CER | Accuracy, % | WER+LM | CER+LM | Accuracy+LM, % |
---|---|---|---|---|---|---|
nvidia/stt_uk_citrinet_1024_gamma_0_25 | 0.0432 | 0.0094 | 95.68% | 0.0352 | 0.0079 | 96.48% |
neongeckocom/stt_uk_citrinet_512_gamma_0_25 | 0.0746 | 0.016 | 92.54% | 0.0563 | 0.0128 | 94.37% |
Model | WER | CER | Accuracy, % |
---|---|---|---|
theodotus/stt_uk_contextnet_512 | 0.0669 | 0.0145 | 93.31% |
This model supports text punctuation and capitalization
Model | WER | CER | Accuracy, % | WER+P&C | CER+P&C | Accuracy+P&C, % |
---|---|---|---|---|---|---|
theodotus/stt_ua_fastconformer_hybrid_large_pc | 0.0400 | 0.0102 | 96.00% | 0.0710 | 0.0167 | 92.90% |
lm-4gram-500k is used as the LM
Model | WER | CER | Accuracy, % | WER+LM | CER+LM | Accuracy+LM, % |
---|---|---|---|---|---|---|
theodotus/stt_uk_squeezeformer_ctc_xs | 0.1078 | 0.0229 | 89.22% | 0.0777 | 0.0174 | 92.23% |
theodotus/stt_uk_squeezeformer_ctc_sm | 0.082 | 0.0175 | 91.8% | 0.0605 | 0.0142 | 93.95% |
theodotus/stt_uk_squeezeformer_ctc_ml | 0.0591 | 0.0126 | 94.09% | 0.0451 | 0.0105 | 95.49% |
lm-4gram-500k is used as the LM
Model | WER | CER | Accuracy, % | WER+LM | CER+LM | Accuracy+LM, % |
---|---|---|---|---|---|---|
Flashlight Conformer | 0.1915 | 0.0244 | 80.85% | 0.0907 | 0.0198 | 90.93% |
Model | WER | CER | Accuracy, % |
---|---|---|---|
robinhad/data2vec-large-uk | 0.3117 | 0.0731 | 68.83% |
Model | WER | CER | Accuracy, % |
---|---|---|---|
v3 | 0.5325 | 0.3878 | 46.75% |
Model | WER | CER | Accuracy, % |
---|---|---|---|
snakers4/silero-models | 0.2356 | 0.0646 | 76.44% |
Model | WER | CER | Accuracy, % |
---|---|---|---|
speechbrain/m-ctc-t-large | 0.57 | 0.1094 | 43% |
Model | WER | CER | Accuracy, % |
---|---|---|---|
tiny | 0.6308 | 0.1859 | 36.92% |
base | 0.521 | 0.1408 | 47.9% |
small | 0.3057 | 0.0764 | 69.43% |
medium | 0.1873 | 0.044 | 81.27% |
large (v1) | 0.1642 | 0.0393 | 83.58% |
large (v2) | 0.1372 | 0.0318 | 86.28% |
Fine-tuned version for Ukrainian:
Model | WER | CER | Accuracy, % |
---|---|---|---|
small | 0.2704 | 0.0565 | 72.96% |
large | 0.2482 | 0.055 | 75.18% |
If you want to fine-tune a Whisper model on own data, then use this repository: https://github.com/egorsmkv/whisper-ukrainian
Model | WER | CER | Accuracy, % |
---|---|---|---|
v0.5 | 0.7025 | 0.2009 | 29.75% |
- How to train own model using Kaldi (in Russian): https://github.com/egorsmkv/speech-recognition-uk/blob/master/vosk-model-creation/INSTRUCTION.md
- How to train a KenLM model based on Ukrainian Wikipedia data: https://github.com/egorsmkv/ukwiki-kenlm
- Export a traced JIT version of wav2vec2 models: https://github.com/egorsmkv/wav2vec2-jit
- Storage Share powered by Nextcloud: https://nx16725.your-storageshare.de/s/cAbcBeXtdz7znDN (use Wget to download, downloading in a browser has speed limitations)
- Torrent file: https://academictorrents.com/details/fcf8bb60c59e9eb583df003d54ed61776650beb8 (188.31 GB)
- Storage Share powered by Nextcloud: https://nx16725.your-storageshare.de/s/f4NYHXdEw2ykZKa
- Mozilla Common Voice has the Ukrainian dataset: https://commonvoice.mozilla.org/uk/datasets
- M-AILABS Ukrainian Corpus Ukrainian: http://www.caito.de/data/Training/stt_tts/uk_UK.tgz
- Transcriptions: https://www.dropbox.com/s/ohj3y2cq8f4207a/transcriptions.zip?dl=0
- Audio files: https://www.dropbox.com/s/v8crgclt9opbrv1/data.zip?dl=0
- VoxForge Repository: http://www.repository.voxforge1.org/downloads/uk/Trunk/
- ASR Corpus created using a Telegram bot for Ukrainian: https://github.com/egorsmkv/asr-tg-bot-corpus
- Speech Dataset with Ukrainian: https://www.caito.de/2019/01/the-m-ailabs-speech-dataset/
- Ukrainian LMs: https://huggingface.co/Yehor/kenlm-ukrainian
- WFST for Ukrainian Inverse Text Normalization: https://github.com/lociko/ukraine_itn_wfst
- Punctuation and capitalization model: https://huggingface.co/dchaplinsky/punctuation_uk_bert
Test sentence with stresses:
К+ам'ян+ець-Под+ільський - м+істо в Хмельн+ицькій +області Укра+їни, ц+ентр Кам'ян+ець-Под+ільської міськ+ої об'+єднаної територі+альної гром+ади +і Кам'ян+ець-Под+ільського рай+ону.
Without stresses:
Кам'янець-Подільський - місто в Хмельницькій області України, центр Кам'янець-Подільської міської об'єднаної територіальної громади і Кам'янець-Подільського району.
RAD-TTS
- RAD-TTS, the voice "Lada"
- RAD-TTS with three voices, voices of Lada, Tetiana, and Mykyta
demo.mp4
Coqui TTS
-
v1.0.0 using M-AILABS dataset: https://github.com/robinhad/ukrainian-tts/releases/tag/v1.0.0 (200,000 steps)
-
v2.0.0 using Mykyta/Olena dataset: https://github.com/robinhad/ukrainian-tts/releases/tag/v2.0.0 (140,000 steps)
tts_output.mp4
Neon TTS
- Coqui TTS model implemented in the Neon Coqui TTS Python Plugin. An interactive demo is available on huggingface. This model and others can be downloaded from huggingface and more information can be found at neon.ai
neon_tts.mp4
Balacoon TTS
- Balacoon TTS, voices of Lada, Tetiana and Mykyta. Blog post on model release.
balacoon_tts.mp4
- Voice "LADA", female: https://github.com/egorsmkv/ukrainian-tts-datasets/tree/main/lada