You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When running full_example.py, the speech recognition itself works fine, but the VAD iterator completely fails to detect voice activity, distinguishing only between "sound" and "silence".
My understanding is that audio_iterator should yield a block of audio data if the input contains voice, and None otherwise. If so, this doesn't work on my system. As long as there is any sound being recorded by the microphone at all, the iterator yields audio blocks. I have tested this with snapping my fingers, scratching on the desk, even the background noise of a ceiling fan running – they all cause the iterator to produce blocks. Only virtually total silence produces None.
As a result, the end of phrase isn't detected unless the room is very, very quiet. I have done multiple test recordings from the same microphone setup and found them to be clear and without additional noise. Yet as soon as there is any input above a certain threshold, even if it is obviously non-human in origin, it is classified as voice. A modern VAD should be able to do much better.
Is this actually working for you? What could be the reason for the VAD to fail so completely?
The text was updated successfully, but these errors were encountered:
When running
full_example.py
, the speech recognition itself works fine, but the VAD iterator completely fails to detect voice activity, distinguishing only between "sound" and "silence".My understanding is that
audio_iterator
should yield a block of audio data if the input contains voice, andNone
otherwise. If so, this doesn't work on my system. As long as there is any sound being recorded by the microphone at all, the iterator yields audio blocks. I have tested this with snapping my fingers, scratching on the desk, even the background noise of a ceiling fan running – they all cause the iterator to produce blocks. Only virtually total silence producesNone
.As a result, the end of phrase isn't detected unless the room is very, very quiet. I have done multiple test recordings from the same microphone setup and found them to be clear and without additional noise. Yet as soon as there is any input above a certain threshold, even if it is obviously non-human in origin, it is classified as voice. A modern VAD should be able to do much better.
Is this actually working for you? What could be the reason for the VAD to fail so completely?
The text was updated successfully, but these errors were encountered: