Quality benchmarks between audiotok / webrtcvad / silero-vad #68

snakers4 · 2021-01-21T04:04:40Z

Instruments

We have compared 3 easy-to-use off-the-shelf instruments for voice activity / audio activity detection:

Silero-vad from here - https://github.com/snakers4/silero-vad;
A popular python version of the webrtcvad - https://github.com/wiseman/py-webrtcvad (this repo);
Audiotok from this repo - https://github.com/amsehili/auditok;

Caveats

Full disclaimer - we are mostly interested in voice detection, not just silence detection;
In our extensive experiments we noticed that WebRTC is actually much better in detecting silence than detecting speech (probably by design). It has a lot of false positives when detecting speech;
audiotok provides Audio Activity Detection, which probably may just mean detecting silence in layman's terms;
silero-vad is geared towards speech detection (as opposed to noise or music);
A sensible chunk size for our VAD is at least 75-100ms (pauses in speech shorter than 100ms are not very meaningful, but we prefer 150-250ms chunks, see quality comparison here), while audiotok and webrtcvad use 30-50ms chunks (we used default values of 30 ms for webrtcvad and 50 ms for audiotok );
We have excluded pyannote-audio for now (https://github.com/pyannote/pyannote-audio), since it features pre-trained models on only limited academic datasets and is mostly a recipe collection / toolkit to build your own tools, not a finished tool per se (also for such a simple task the amount of code bloat is puzzling from a production standpoint, our internal vad training code is just literally 5 python modules);

Methodology

Please refer here - https://github.com/snakers4/silero-vad#vad-quality-metrics-methodology

Quality Benchmarks

Finished tests:

Portability and Speed

Looks like originally webrtcvad is written in С++ around 2016, so theoretically it can be ported into many platforms;
I have inquired in the community, the original VAD seems to have matured and python version is based on 2018 version;
Looks like audiotok is written in plain python, but I guess the algorithm itself can be ported;
silero-vad is based on PyTorch and ONNX, so it boasts the same portability options both these frameworks feature (mobile, different backends for ONNX, java and C++ inference APIs, graph conversion from ONNX);

This is by no means an extensive and full research on the topic, please point out if anything is lacking.

The text was updated successfully, but these errors were encountered:

matanox · 2021-01-21T18:29:16Z

You've sure done some thorough work here. Just as a sanity check, looks like the deep neural network model is the only one worth using for real world action, does it not? I wonder in what ways is the WebRTC VAD model even useful for the WebRTC project itself ....

snakers4 · 2021-01-22T02:40:46Z

Despite the appearance, web rtc is not so bad
You see if you just use web rtc to suppress silence it works just fine

False positives and lack of easy tuning / interpretable parameters / docs / support are the main culprit

Also for this reason we just used standard params - we may be wrong somewhere and it can be tuned better, but 95% of users will not bother

sharvil · 2021-06-09T19:17:47Z

It seems that the Silero VAD and WebRTC VAD make different tradeoffs.

WebRTC produces a VAD decision on 10ms to 30ms frames, whereas Silero produces a VAD decision on 150ms to 250ms frames. While it's true that short silences on the order of 30ms aren't particularly meaningful, the resolution of a VAD decision may be. In some applications, it may not be acceptable to discover up to 125ms late of a transition between speech and silence. WebRTC is designed to provide decisions in low-latency streaming applications where having a 100+ms buffer is not acceptable.

I'm happy to see implementations explore different tradeoffs in the design space. Looking at a PR-curve alone, though, doesn't tell the full story.

snakers4 · 2021-06-09T20:06:22Z

whereas Silero produces a VAD decision on 150ms to 250ms frames

While it is true that we cannot really go below 100ms windows, there is just too much noise
You can ofc use 100ms as well with some quality degradation - snakers4/silero-vad#2 (comment)
On the other hand, we design around this limitation by simply applying our VAD in rolling a window fashion, so you essentially can get x4 - x8 resolution (i.e. 250ms // 4 or 250ms // 8)
The only downside of this is that you essentially have to use more compute
We also designed around that by providing 1m / 100k / 10k param sized models

snakers4 · 2021-06-09T20:06:53Z

Also community provided some illustrative comparisons https://github.com/snakers4/silero-vad#live-demonstration

This was referenced Jan 21, 2021

VAD quality #53

Open

White noise detected as voice #60

Closed

A more modern VAD with pre-trained models jtkim-kaist/VAD#39

Open

A more modern VAD with pre-trained models marsbroshok/VAD-python#21

Open

hagenw mentioned this issue Jun 10, 2022

DOC: add usage section audeering/audinterface#46

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quality benchmarks between audiotok / webrtcvad / silero-vad #68

Quality benchmarks between audiotok / webrtcvad / silero-vad #68

snakers4 commented Jan 21, 2021

matanox commented Jan 21, 2021 •

edited

Loading

snakers4 commented Jan 22, 2021

sharvil commented Jun 9, 2021 •

edited

Loading

snakers4 commented Jun 9, 2021

snakers4 commented Jun 9, 2021

Quality benchmarks between audiotok / webrtcvad / silero-vad #68

Quality benchmarks between audiotok / webrtcvad / silero-vad #68

Comments

snakers4 commented Jan 21, 2021

Instruments

Caveats

Methodology

Quality Benchmarks

Portability and Speed

matanox commented Jan 21, 2021 • edited Loading

snakers4 commented Jan 22, 2021

sharvil commented Jun 9, 2021 • edited Loading

snakers4 commented Jun 9, 2021

snakers4 commented Jun 9, 2021

matanox commented Jan 21, 2021 •

edited

Loading

sharvil commented Jun 9, 2021 •

edited

Loading