Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VOSK STT Engine #280

Closed
aaronchantrill opened this issue Jun 28, 2020 · 10 comments · May be fixed by #367
Closed

VOSK STT Engine #280

aaronchantrill opened this issue Jun 28, 2020 · 10 comments · May be fixed by #367
Labels
Good First Issue! Hacktoberfest Small or non-core issues that could be worked on by Hacktoberfest participants Priority: Medium Status: In Progress Type: Enhancement

Comments

@aaronchantrill
Copy link
Contributor

aaronchantrill commented Jun 28, 2020

Detailed Description

VOSK (https://alphacephei.com/vosk/) is a new open-source STT toolkit/engine built on Kaldi and which is optimized to run on Raspberry Pi. Building a language model is described here https://alphacephei.com/vosk/adaptation.html

Context

Learning to train and adapt the acoustic model, language model and dictionary is enormously helpful in speech recognition. The more you can reduce the total range of probabilities, the better the recognition becomes. Naomi has an advantage in that we have a list of phrases that can be used to build a language model directly from.

Possible Implementation

VOSK can be installed with a simple pip3 install vosk. The training tools are basically Kaldi, but it is not necessary to install Kaldi to use VOSK. The adaptation page shows a good start on developing a language model from the intent templates.

@aaronchantrill
Copy link
Contributor Author

aaronchantrill commented Oct 15, 2022

I am working on this, and the reliability of VOSK is pretty amazing. It is also pretty lightweight and easy to install. I am currently trying to adapt the Language Model using the instructions at https://alphacephei.com/vosk/lm. From what I'm understanding right now, I need to convert all the speechhandler intent templates into a JSGF file, then use that to generate an ARPA statistical model, then interpolate that with the default VOSK language model. Phonetisaurus works for generating a custom dictionary, and the VOSK compile model (https://alphacephei.com/vosk/models/vosk-model-en-us-0.22-compile.zip) comes with a pre-trained fst file to use with Phonetisaurus for generating new pronunciations.

@Akul2010
Copy link

Akul2010 commented Jan 8, 2023

I found this github link for making a custom model: https://github.com/matteo-39/vosk-build-model

@aaronchantrill
Copy link
Contributor Author

@Akul2010 Sorry, I meant to get back to you earlier. That is a very interesting set of instructions for building a VOSK model, but overkill for anything we'd be doing. Using those instructions, you could add a whole new language to VOSK, which is awesome.

We just need to customize the Gr.fst and HCLr.fst with custom words and phrases. The process is described here: https://alphacephei.com/vosk/lm and supports English, French, German, and Russian and is pretty straightforward, but it requires installing both Kaldi and SRILM. Kaldi is usually pretty easy to install, although the last time I installed on a new Bookworm system I had to trick the installer into thinking that I had Python 2.7 installed since it still thinks it needs it for the install process, but it is no longer available through my package manager. I see some discussion that Python 2.7 was really only required for Pocketsphinx, which has also updated to Python 3, so hopefully Kaldi will drop that requirement soon. SRILM is open-source and available for academic and government use but is not freely available. You have to register an account to download it. On my Raspberry Pi I had to trick it into compiling on aarch64 by modifying make files as described here: https://github.com/G10DRAS/SRILM-on-RaspberryPi

There are other, free-er libraries that can be used instead of SRILM, including KenLM which is very lightweight and we are already using for building language models for Pocketsphinx (although with a much smaller vocabulary). I'm not sure about the process of converting a language model file to fst format, though.

Overall, the process of getting the Raspberry Pi set up is not simple, but once you have it set up all you have to do is drop your vocabulary into the db/extra.txt file, then run compile-graph.sh and wait for it to finish so you can pick up your new vocabulary G.fst and HCL.fst files from exp/chain/tdnn/graph.

The last time I tried this, I tried it on a couple of computers and kept running into memory issues. I finally got it working under WSL on a Windows machine with 32 GiB of ram. I'm getting ready to try again with my 8GiB Raspberry Pi 5.

@aaronchantrill
Copy link
Contributor Author

The Raspberry Pi 5 was able to do it! It did cut off all communication for a little while and I'm not sure how long it took, but it was able to build a HCLG.fst file which I am using now and does recognize my custom vocabulary.

@Akul2010
Copy link

Akul2010 commented Mar 5, 2024

Great! Do you plan on making it available in maybe the next few builds on Naomi?

@aaronchantrill
Copy link
Contributor Author

@Akul2010 I think what makes sense is just to write up the steps required for generating a custom vocabulary for now and put a check in place that notifies you if there are any words in the current "languagemodel" file (ie, ~/.config/naomi/vocabularies/en-US/VOSK STT/default/languagemodel) that do not also appear in the vosk words.txt (ie, ~/.config/naomi/vosk/vosk-model-en-us-0.22-lgraph/graph/words.txt) file.

@aaronchantrill
Copy link
Contributor Author

It would be good to see if we can use KenLM to generate a language model and then convert that to an HCLG.fst file. I'm really not comfortable requiring people to go out and register with SRI so they can download a copy of SRILM.

@aaronchantrill
Copy link
Contributor Author

It is currently available from https://github.com/aaronchantrill/Naomi_VOSK_STT but I haven't added it to NPE yet because it's so difficult to customize the vocabulary. I think if I add the check to warn the user if the vocabulary they are using uses any words that Vosk does not currently know with a link to a detailed description of how to generate a custom Vosk vocabulary, that will be enough for me to feel good about adding it to the NPE.

@aaronchantrill
Copy link
Contributor Author

@Akul2010 I have updated Naomi_VOSK_STT plugin at https://github.com/aaronchantrill/Naomi_VOSK_STT - it still doesn't do the language model adaptation automatically, but it does give you warnings if there are any words in your vocabulary that it doesn't know. I added a credit at the bottom for you since we never managed to get your pull request merged. Thanks! I'll be submitting this plugin to NPE later today, and will be recording a new "How to install Naomi" video soon.

@Akul2010
Copy link

Great! Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Good First Issue! Hacktoberfest Small or non-core issues that could be worked on by Hacktoberfest participants Priority: Medium Status: In Progress Type: Enhancement
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants