-
-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
VOSK STT Engine #280
Comments
I am working on this, and the reliability of VOSK is pretty amazing. It is also pretty lightweight and easy to install. I am currently trying to adapt the Language Model using the instructions at https://alphacephei.com/vosk/lm. From what I'm understanding right now, I need to convert all the speechhandler intent templates into a JSGF file, then use that to generate an ARPA statistical model, then interpolate that with the default VOSK language model. Phonetisaurus works for generating a custom dictionary, and the VOSK compile model (https://alphacephei.com/vosk/models/vosk-model-en-us-0.22-compile.zip) comes with a pre-trained fst file to use with Phonetisaurus for generating new pronunciations. |
I found this github link for making a custom model: https://github.com/matteo-39/vosk-build-model |
@Akul2010 Sorry, I meant to get back to you earlier. That is a very interesting set of instructions for building a VOSK model, but overkill for anything we'd be doing. Using those instructions, you could add a whole new language to VOSK, which is awesome. We just need to customize the Gr.fst and HCLr.fst with custom words and phrases. The process is described here: https://alphacephei.com/vosk/lm and supports English, French, German, and Russian and is pretty straightforward, but it requires installing both Kaldi and SRILM. Kaldi is usually pretty easy to install, although the last time I installed on a new Bookworm system I had to trick the installer into thinking that I had Python 2.7 installed since it still thinks it needs it for the install process, but it is no longer available through my package manager. I see some discussion that Python 2.7 was really only required for Pocketsphinx, which has also updated to Python 3, so hopefully Kaldi will drop that requirement soon. SRILM is open-source and available for academic and government use but is not freely available. You have to register an account to download it. On my Raspberry Pi I had to trick it into compiling on aarch64 by modifying make files as described here: https://github.com/G10DRAS/SRILM-on-RaspberryPi There are other, free-er libraries that can be used instead of SRILM, including KenLM which is very lightweight and we are already using for building language models for Pocketsphinx (although with a much smaller vocabulary). I'm not sure about the process of converting a language model file to fst format, though. Overall, the process of getting the Raspberry Pi set up is not simple, but once you have it set up all you have to do is drop your vocabulary into the db/extra.txt file, then run compile-graph.sh and wait for it to finish so you can pick up your new vocabulary G.fst and HCL.fst files from exp/chain/tdnn/graph. The last time I tried this, I tried it on a couple of computers and kept running into memory issues. I finally got it working under WSL on a Windows machine with 32 GiB of ram. I'm getting ready to try again with my 8GiB Raspberry Pi 5. |
The Raspberry Pi 5 was able to do it! It did cut off all communication for a little while and I'm not sure how long it took, but it was able to build a HCLG.fst file which I am using now and does recognize my custom vocabulary. |
Great! Do you plan on making it available in maybe the next few builds on Naomi? |
@Akul2010 I think what makes sense is just to write up the steps required for generating a custom vocabulary for now and put a check in place that notifies you if there are any words in the current "languagemodel" file (ie, ~/.config/naomi/vocabularies/en-US/VOSK STT/default/languagemodel) that do not also appear in the vosk words.txt (ie, ~/.config/naomi/vosk/vosk-model-en-us-0.22-lgraph/graph/words.txt) file. |
It would be good to see if we can use KenLM to generate a language model and then convert that to an HCLG.fst file. I'm really not comfortable requiring people to go out and register with SRI so they can download a copy of SRILM. |
It is currently available from https://github.com/aaronchantrill/Naomi_VOSK_STT but I haven't added it to NPE yet because it's so difficult to customize the vocabulary. I think if I add the check to warn the user if the vocabulary they are using uses any words that Vosk does not currently know with a link to a detailed description of how to generate a custom Vosk vocabulary, that will be enough for me to feel good about adding it to the NPE. |
@Akul2010 I have updated Naomi_VOSK_STT plugin at https://github.com/aaronchantrill/Naomi_VOSK_STT - it still doesn't do the language model adaptation automatically, but it does give you warnings if there are any words in your vocabulary that it doesn't know. I added a credit at the bottom for you since we never managed to get your pull request merged. Thanks! I'll be submitting this plugin to NPE later today, and will be recording a new "How to install Naomi" video soon. |
Great! Thank you! |
Detailed Description
VOSK (https://alphacephei.com/vosk/) is a new open-source STT toolkit/engine built on Kaldi and which is optimized to run on Raspberry Pi. Building a language model is described here https://alphacephei.com/vosk/adaptation.html
Context
Learning to train and adapt the acoustic model, language model and dictionary is enormously helpful in speech recognition. The more you can reduce the total range of probabilities, the better the recognition becomes. Naomi has an advantage in that we have a list of phrases that can be used to build a language model directly from.
Possible Implementation
VOSK can be installed with a simple
pip3 install vosk
. The training tools are basically Kaldi, but it is not necessary to install Kaldi to use VOSK. The adaptation page shows a good start on developing a language model from the intent templates.The text was updated successfully, but these errors were encountered: