- EP 1 is for Petoi with USB Adaptor with MacBook Pro
EP 1 Video Link - EP 2 is for Petoi with RaspberryPi
EP 2 Video Link
Petoi's Bittle is a palm-sized, opensource, programmable robot dog for STEM and fun. Bittle can connect with Raspberry Pi and can be easily extended. This project is done during my intern in Petoi. My goal was to develop a real-time voice control module for Bittle and command Bittle to do some actions.
The conclusion is that, I use VAD(Voice Activity Detection) + DTW + Vosk
I used PyAudio at the beginning, but it is an old library. So I used sounddevice and soundfile instead.
####Command/Key Words Recognition
From a functional point of view, the methods to do this can be divided into:
- Speech to Text. And then look up the commands in the text. One good thing is that this can be combined with NLP applications but this is an overkill for Speech2Text.
- Use acoustic features to do analysis and detect commands.
DTW (Dynamic Time Warping) (Used)
This belongs to the second category and it's similar to template matching. DTW can calculate the cost to match one piece of audio with a template audio. We can pick the audio with the lowest cost. This method does not need training and is also applicable even if you want to add new commands. The bad thing is that the calculation is time consuming. But at least the command audios are short in time and we can find ways to eliminate the silence and extract MFCC(Mel-frequency Cepstral Coefficients) feature.
CNN for Command/Key Word Recognition
This is a demo Speech Command Recognition with torchaudio — PyTorch Tutorials which is done by PyTorch Official. But we need to re-train the model when we have new commands coming in.
I was inspired by a blog Audio Handling Basics: Process Audio Files In Command-Line or Python | Hacker Noon . The blog mentions that we can eliminate the silence part of an audio recording according to the short-term energy of audio data. A Python library called librosa provides some functions for doing that.
I tried some open source methods:
Offline recognition, provides light-weight tflite models for low-resource devices.
Requires 16bit 16KHz mono channel audio. A new version supports Chinese.
I tested it by using non-strip and stripped audios with both large and small size models but it did not do well. For example:
- 起立 -> 嘶力/成立
- 向前跑 -> 睡前跑
- 向前走 -> 当前走
So I tested it again using English:
- Hey Bittle
- Stand up
- Walk forward
- Run forward
I have used 16 recordings for now. An empty result is shown when it encounters OOV(out of vocabulary) words. "Bittle" would be recognized as "be to". After silence elimination, some results have changed from wrong to correct, and some have changed from correct to wrong (this may be due to the reduction of the silence between the pronunciation of words).
16 English Tests, 9 were correct &16 Chinese Tests, 3 were correct.
It does not have light-weight models and the models are near 900MB. It's too big for a Raspberry Pi.
It provides multiple ways such as using Google/MS Api. The only one method to offline recognition is no longer being maintained.
- alphacep/vosk (Used)
Vosk provides offline recognition and light-weighted models for both Chinese and Chinese. The documation is not comlplete.
A test result for Chinese model
Non-strip Correct | Stripped Correct | Total correct |
---|---|---|
16/21 | 16/21 | 32/42 |
-
Create a virtual environment on your PC(NOT Pi) with
python==3.7.3
and then activate it. -
Install
portaudio
:- Win/Mac/Linux:
conda install portaudio
- Or the Mac Terminal:
brew install portaudio
- Win/Mac/Linux:
-
Install the remaining dependencies:
pip install -r requirements.txt
-
Download vosk model. Choose to download
vosk-model-small-en-us
andvosk-model-small-cn
for future use. The former is for English and another one is for Chinese. Both of them are small models. -
English model is the default choice for now. After you download and extract the model ZIP file, put the folder into
./models
and make sure the folder name is the same asvosk_model_path
in your config.
-
Use terminal/cmd to cd into my_vosk folder. Enter:
python vosk_microphone_pi.py
-
You can skip the recording step. The pre-recorded recording is saved as ./recordings/template_1.wav, while another one with
raw
in its file name means that's the one without being stripped. -
Finetune the
threshold
for wakeup recognition. In config.yml, the value is now 0 for easy debug. After "开始监听(start listening)", check the console forDTW.normalizedDistance
. For example:- Before you say the "wakeup word", the value is between 115-125;
- After you say the "wakeup word", the value should decrease. Maybe between 105-117.
For this case, you can setthreshold
as 120(strict) or 115(not so strict).
-
You should go into command recognition after waking up Petoi. There are many pre-defined commands in cmd_lookup.py. You can say "stand up", for example.
-
If you want to use Chinese model(similar for other languages):
- Unzip the Chinese model and put the folder into
./models
Set thevosk_model_path
in config.yml. - Set the
cmd_table
in config.yml as below:
cmd_table: package: my_vosk.common.cmd_lookup table_name: cmd_table_cn build_dict: build_dict_cn
- Unzip the Chinese model and put the folder into