Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vad plugin #161

Merged
merged 9 commits into from
Feb 24, 2019
Merged

Vad plugin #161

merged 9 commits into from
Feb 24, 2019

Conversation

aaronchantrill
Copy link
Contributor

@aaronchantrill aaronchantrill commented Jan 29, 2019

Description

VAD Plugin

naomi/application.py

Attached input device parameters (input_samplerate, input_samplewidth, input_channels, input_chunksize) to the input_device object so they are all available to the mic and vad objects as the input_device object is passed around.

For consistency, also moved the output_chunksize and output_padding parameters to the output_device object.

Added initialization of Voice Activity Detection object and passed it to the initialization of the mic object.

naomi/mic.py

Removed the main logic around listen and active listen so I could move it into the VADPlugin class. This should make it easier to implement the "Passive Listen for Commands" project, because once
the passive listener identifies a keyword in the audio returned, we can just pass the same block of audio to the active listener for transcription. Simplified a lot of stuff. The original authors
were running two threads constantly scanning the audio input for keywords, and it appears that the only reason was to speed up keyword detection. The new VAD method works much differently, but
I'm interested in hearing whether anyone notices a difference.

naomi/plugin.py

Added skeleton for VADPlugin class

naomi/pluginstore.py

Added the VADPlugin as a new plugin class

naomi/testutils.py

Added a test audio_device class for my VAD tests

plugins/vad

Added two new plugins, snr_vad and webrtc_vad.

plugins/vad/snr_vad

This is based on the way voice activity currently works with naomi, which is basically just waiting for audio levels to go above a certain threshold and below a certain threshold.

I have always had trouble with this method, as different sound cards and different microphones register sound quite differently, so choosing a proper threshold level is often problematic.

I am now saying that anything over the mean plus one and a half times the standard deviation should be considered audio to pay attention to. I also reset every 100 samples by cutting all the
counts in half, thus ensuring that we aren't counting to ridiculous numbers over time and allowing noise levels to adjust fairly quickly to changes in the environment.

plugins/webrtc_vad

Uses the webrtcvad module, which can be installed via pip. This module requires that all chunks be 10, 20, or 30 ms. The default chunk size for Naomi is 64 ms, so you have to adjust the value of

audio:
  input_chunksize:

to either 160 (10ms), 320 (20ms), or 480 (30ms) in profile.yml, assuming a rate of 16000 samples/sec.

Related Issue

VAD plugin #144
[Feature-Request] - Passive Listening for commands #48
Automate STT training #103

Motivation and Context

This allows us to quickly and easily write and test Voice Activity Detection plugins without having to modify the main structure of Naomi. It also simplifies some handling of audio which should make some other projects simpler. And all of this should allow us to improve overall speech capture for building catalogs of data samples for training the STT engines.

How Has This Been Tested?

I have tested both plugins on both my x86 Raspbian Stretch VirtualBox machine and my Raspberry Pi 3B+.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My code follows the code style of this project. (In fact, I fixed a bunch of flake8 complaints)
  • My change requires a change to the documentation. (Need to explain the new plugin type)
  • I have updated the documentation accordingly. (Added VAD Plugin documentation #4)
  • I have added tests to cover my changes.
  • All new and existing tests passed. (or at least didn't get worse)

Replaced imp module with importlib module due to deprecation warning.

Simplified import of configparser so it is no longer trying to handle
Python2 imports.

Re-wrote the parse_plugin_class function to use importlib rather than
imp.

Changed plugin_classes initialization to fix a pep8 complaint:
W504 line break after binary operator
naomi/application.py
--------------------
Attached input device parameters (input_samplerate, input_samplewidth,
input_channels, input_chunksize) to the input_device object so they
are all available to the mic and vad objects as the input_device
object is passed around.

For consistency, also moved the output_chunksize and output_padding
parameters to the output_device object.

Added initialization of Voice Activity Detection object and passed
it to the initialization of the mic object.

naomi/mic.py
------------
Removed the main logic around listen and active listen so I could
move it into the VADPlugin class. This should make it easier to
implement the "Passive Listen for Commands" project, because once
the passive listener identifies a keyword in the audio returned,
we can just pass the same block of audio to the active listener
for transcription. Simplified a lot of stuff. The original authors
were running two threads constantly scanning the audio input for
keywords, and it appears that the only reason was to speed up
keyword detection. The new VAD method works much differently, but
I'm interested in hearing whether anyone notices a difference.

naomi/plugin.py
---------------
Added skeleton for VADPlugin class

naomi/pluginstore.py
--------------------
Added the VADPlugin as a new plugin class

plugins/vad
-----------
Added two new plugins, snr_vad and webrtc_vad.

plugins/vad/snr_vad
-------------------
This is based on the way voice activity currently works with
naomi, which is basically just waiting for audio levels to go above
a certain threshold and below a certain threshold.

I have always had trouble with this method, as different sound
cards and different microphones register sound quite differently,
so choosing a proper threshold level is often problematic.

I am now saying that anything over the mean plus one and a half
times the standard deviation should be considered audio to pay
attention to. I also reset every 100 samples by cutting all the
counts in half, thus ensuring that we aren't counting to
rediculous numbers over time and allowing noise levels to adjust
fairly quickly to changes in the environment.

plugins/webrtc_vad
------------------
Uses the webrtcvad module, which can be installed via pip.
This module requires that all chunks be 10, 20, or 30 ms.
The default chunk size for Naomi is 64 ms, so you have to
adjust the value of
audio:
  input_chunksize:
to either 160 (10ms), 320 (20ms), or 480 (30ms) in profile.yml.
Modified the two VAD plugins so they can receive configuration
values passed through from the profile.yml

For SNR_VAD, the following options can be set:
snr_vad:
  timeout: 1
  minimum_capture: 0.25
  threshold: 20

For webrtc_vad, the following options can be set:
webrtc_vad:
  timeout: 1
  minimum_capture: 0.25
  aggressiveness: 1
Oops, I changed the name of the SNR VAD plugin from snr to snr_vad.
Left the default at snr, though, so if you have no vad selected in
profile, then you get an error that plugin "vad" does not exist.

Fixed, set default to "snr_vad" matching documentation and reality.
@AustinCasteel AustinCasteel modified the milestones: 3.0.m1, 3.0.m2 Feb 5, 2019
Added a couple of unit tests for the VAD plugins. Also changed
the strucure of the __init__ methods to accommodate testing.

The tests work by sending an "empty" sound which should result
in _voice_detected() returning False, and then a small clip
from the naomi/data/audio/naomi.wav file which should result
in _voice_detected() returning True.
Removed some unused modules and variables and cleaned up the
formatting some.
Added the ability to change the logging level while running
VAD tests. If logging is set to INFO or DEBUG, then this will
print a timeline of where voice data is detected by the plugin.

Also removed extra assert statements that were bugging codacy.
After completing the VAD Plugin Testing classes for both plugins,
realized that almost all the code was duplicated between the
two plugins.

Combined almost all the code into the Test_VADPlugin class
located in naomi/testutils.py, which should make it easier to maintain
and simplify the development of new plugins.

I had also added some code which caused the test routine
to output a map of where audio was and was not detected
if the test is run at the info or debug logging levels (the
default logging level for unittests is warn). This had to be
enabled by uncommenting a line on each test if I wanted to
compare the results. With this change, uncommenting or
commenting one line in testutils.py affects the behavior of
both tests.
Codacy complained that the overridden setUp methods used a
different set of parameters.

I have set all to just def setUp(self): and also added a
callback from the testutils.Test_VADPlugin class
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants