Skip to content

Training your own wake word

patrickxia edited this page Jan 1, 2021 · 23 revisions

How to train your own wake word

Precise comes with a few executables used to train and test models. First, run through the Source Install procedure on the readme. Once installed, to gain access to these executables in the current terminal session, run the command:

source .venv/bin/activate

Here's a summary of all the executables:

  • precise-collect - Record audio samples for use with Precise
  • precise-convert - Convert wake-word model from Keras to TensorFlow
  • precise-eval - Evaluate a list of models on a dataset
  • precise-listen - Run a model on microphone audio input
  • precise-engine - Run a model on raw audio data from stdin
  • precise-test - Test a model against a dataset
  • precise-train - Train a new model on a dataset
  • precise-train-incremental - Train a model to inhibit activation by marking false activations and retraining

For more info on each individual script, you can run <script-name> -h.

Overview

The rough process for training a model is as follows:

  1. precise-collect - Record wake word samples
  2. precise-train - Initial training
  3. precise-train-incremental - Reduce false activations
  4. precise-test - Statistics on dataset accuracy
  5. precise-listen - Real world test with your microphone
  6. precise-convert - Convert .net to .pb

Recording Samples

The first thing you'll want to do is record some audio samples of your wake word. To do that, use the tool, precise-collect, which will guide you through recording a few samples. The default settings should be fine.

Use this tool to collect around 12 samples, making sure to leave a second or two of silence at the start of each recording, but with no silence after the wake word.

$ precise-collect
Audio name (Ex. recording-##): hey-computer.##
ALSA lib pcm_dsnoop.c:638:(snd_pcm_dsnoop_open) unable to open slave
ALSA lib pcm_dmix.c:1099:(snd_pcm_dmix_open) unable to open slave
ALSA lib pcm_dmix.c:1099:(snd_pcm_dmix_open) unable to open slave
Press space to record (esc to exit)...
Recording...
Saved as hey-computer.00.wav
Press space to record (esc to exit)...

Audio files from precise-collect will be WAV files in little-endian, 16 bit, mono, 16000hz PCM format. FFMpeg calls this “pcm_s16le”. If you are collecting samples using another program they must be converted to the appropriate format using an ffmpeg command:

$ ffmpeg -i input.mp3 -acodec pcm_s16le -ar 16000 -ac 1 output.wav

Now, place most of these files under hey-computer/wake-word/ and the rest under hey-computer/test/wake-word:

hey-computer/
├── wake-word/
│   ├── hey-computer.00.wav
│   ├── hey-computer.01.wav
│   ├── hey-computer.02.wav
│   ├── hey-computer.03.wav
│   ├── hey-computer.04.wav
│   ├── hey-computer.05.wav
│   ├── hey-computer.06.wav
│   ├── hey-computer.07.wav
│   └── hey-computer.08.wav
├── not-wake-word/
└── test/
    ├── wake-word/
    │   ├── hey-computer.09.wav
    │   ├── hey-computer.10.wav
    │   ├── hey-computer.11.wav
    │   └── hey-computer.12.wav
    └── not-wake-word/

This tells Precise to train on the first 8 samples and evaluate the model's accuracy using the last 4.

Initial Training

Now, we can start to train a model with the precise-train tool:

$ precise-train -e 60 hey-computer.net hey-computer/
...
Epoch 1/20
2018-02-23 11:32:05.235740: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
9/9 [==============================] - 0s 31ms/step - loss: 0.2620 - acc: 0.3333 - val_loss: 0.0836 - val_acc: 0.5000
...
Epoch 60/60
9/9 [==============================] - 0s 1ms/step - loss: 0.0025 - acc: 1.0000 - val_loss: 5.6518e-05 - val_acc: 1.0000

Demoing the Model

Now, we can run this model against live microphone input using precise-listen. It will listen to the microphone and output confidence bars. Each line represents one measurement: the more Xs there are, the more confident that the model believes that the wake word was uttered. Any Xs over the threshold are denoted with a lowercase x.

$ precise-listen hey-computer.net
Using TensorFlow backend.
2018-02-23 12:46:22.622717: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
ALSA lib pcm_dmix.c:1099:(snd_pcm_dmix_open) unable to open slave
ALSA lib pcm_route.c:867:(find_matching_chmap) Found no matching channel map
ALSA lib pcm_dmix.c:1099:(snd_pcm_dmix_open) unable to open slave
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
XXX-----------------------------------------------------------------------------
XXXXXXXX------------------------------------------------------------------------
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXx---------------------------------------
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXxxxx------------------------------------
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXxxxxxxxxx-------------------------------
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXxxxxxxxxxxxxxxxxxxx---------------------
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXxxxxxxxxxxxxxxxxxxxxx-------------------
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXxxxxxxxxxxxxxxxxxxxxxxx-----------------
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXxxxxxxxxxxxxxxxxxxxxxxxxx---------------
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXxx--------------------------------------
XXXXXXXXXXXXX-------------------------------------------------------------------
XX------------------------------------------------------------------------------
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------

As you can see, the model has simply learned to activate to every noise, rather than specifically our wake word. Now what we need to do is reduce false activations by incorporating some data that the model should not activate on.

Reducing False Activations

There are two ways of reducing false activations: recording your own false activations and putting them in hey-computer/not-wake-word, or using an automated process to find false positives in large, audio files filled with everyday noise.

Method 1

To record your own false activations, launch precise-listen in save mode:

precise-listen hey-computer.net -d hey-computer/not-wake-word

Now you can say words similar to your wake word and everytime the model activates, it will save that recording into the hey-computer/not-wake-word folder. Just make sure never to say the actual wake word while in save mode.

Once you've gathered a few samples of new false activations, retrain your model with the same precise-train command:

precise-train hey-computer.net hey-computer/ -e 600

You can stop training with ctrl+c once the accuracy (acc) gets close to 1.0. Now, you can repeat the process, running precise-listen again. You should notice the model learned not to activate on what it had failed on before.

Method 2

While the first method works to a certain degree, you will still notice a large number of false activations during just everyday noise. To reduce the number of times the model activates when it shouldn't, we need a bunch of long audio files that don't have the wake word in it. You can use pretty much any set of sounds, but a diversified set of audio is better. A good place to start is the Public Domain Sounds Backup. You can download it with:

cd data/random
wget http://downloads.tuxfamily.org/pdsounds/pdsounds_march2009.7z
# Install p7zip
7z x pdsounds_march2009.7z
cd ../..

After downloading a set of sounds, they probably won't be in the right format. They need to be 16 bit signed integer mono wav files with a sample rate of 16000. However, don't worry if that's not the case. All we need is the command line tool, ffmpeg, and the following script:

SOURCE_DIR=data/random/mp3
DEST_DIR=data/random

for i in $SOURCE_DIR/*.mp3; do echo "Converting $i..."; fn=${i##*/}; ffmpeg -i "$i" -acodec pcm_s16le -ar 16000 -ac 1 -f wav "$DEST_DIR/${fn%.*}.wav"; done

Here, you can see it runs ffmpeg input.mp3 -acodec pcm_s16le -ar 16000 -ac 1 output.wav on all the mp3 files, placing the results in data/random. Now we are ready to reduce false activations. Begin the process with the command:

precise-train-incremental hey-computer.net hey-computer/ -r data/random/

Now you will see it run through all the wav files in data/random, picking out clips where it false activated, placing them into the hey-computer/not-wake-word directory, and retraining. This process will take a while, depending on the total length of audio in the dataset and your processor speed.

Once it finishes, we can look at how it performs against the test dataset with:

precise-test hey-computer.net hey-computer/

And, we can test it again through the microphone with:

precise-listen hey-computer.net

Finally, if there are still too many false activations we can add more audio to data/random and repeat the process. If it looks good, continue below.

Converting the Model

So far, we've only dealt with .net files. This extension used throughout Precise represents an HDF5 model file trained with Keras. To reduce runtime dependencies, you must convert the .net Keras model into a .pb TensorFlow model. Do this with the following command:

precise-convert hey-computer.net

That's it! Now the final, exported model consists of the following two files:

  • hey-computer.pb
  • hey-computer.pb.params

The first contains the TensorFlow neural network and the second contains details specific to Precise for how the audio was processed for the network.