⚠️Public pre-test of Silero-VAD v5 #448

snakers4 · 2024-04-22T08:10:25Z

Dear members of the community,

Finally, we are nearing the release of the v5 version of the VAD.

Can you please send your audio edge cases in this ticket so that we could stress test the new release of the VAD in advance.

Ideally we need something like this #369 (which we incorporated into validation when choosing the new models), but any systematic cases where the VAD underperforms will be good as well.

Many thanks!

The text was updated successfully, but these errors were encountered:

rizwanishaq · 2024-04-25T08:05:20Z

"When is the release scheduled for v5?"

whaozl · 2024-04-30T02:57:02Z

I find the v4 for chinese single word 【bye】，is not good.

and the cantonese single word 【喺啊】and 【喺】 is not good.

asusdisciple · 2024-05-02T13:38:13Z

I do not have any edge cases but it would be nice if you could change your benchmark methodology. There are a lot models out there by now. Adopting some new datasets like dihard3 etc. and comparing them against other sota models like pyannote would be dope.

Purfview · 2024-05-02T16:23:32Z

Systematic cases would be:
The false-positives on ~silence. (introduced in v4)
Inaccurate end of segments, trailing usually includes up to ~1000 ms of "padding". (introduced in v4)
Maybe not systematic but often the start of segment is ~100ms too late.

cassiotbatista · 2024-05-14T12:41:08Z

Hi, it's me again 😄

We've done some experiments on what we called "model expectation" w.r.t. the LSTM states' reset frequency.

Recall from the previous issue that my interest is mainly in always-on scenarios, which consist of a VAD listening all the time to whatever is going on in the environment and triggering only when there's speech, which we'll assume to be a rare event. As such, the model would be expected to trigger only a few times (a day, say) w.r.t. the infinite audio stream that it keeps receiving over time.

The experiment consists in feeding long-ish stream of non-speech data to the model and check how often it hallucinates --- i.e., how often it sees speech when there is none. For that, we used Cafe, Home and Car environments from QUT-NOISE dataset, which contains 30-50 minute-long noise-only audio recordings.

In theory, we presume that one is advised to reset the model states only after it has seen speech, but we took the liberty to reset at regular time intervals irrespective of whether speech detection has been triggered.

The following plots show Scikit learn error rate (1-acc, which goes up to 100% == 1.00), therefore formulating the VAD as a frame-wise binary classification problem. X-axis show the frequency of model state resetting. Finally, v3 and v4 models are shown in blue and red colors, respectively.

pic 1	pic 2	pic 3

I'll formulate my conclusions later when I have time, just wanted to provide a heads-up asap since it's been a while since this issue has been opened.

EDIT: conclusions!

First of all, just notice that the graphs are not in the same scale, so the models make way less mistakes in car environments (4% vs. ~20% otherwise), for example.

v4 again shows worse results than v3: the red curves are above the blue ones most of the time, indicating that v3 seems to be indeed more resilient to these environmental noises than v4. Remember that the y-axis represents error rate, so the lower, the better.
v4 presents an expectation to speech across all three noisy environments right after the model is reset. That is shown by the error being higher at smaller reset interval times: the more often one resets the model states, the more it hallucinates. This is also true for the v3 model except for the car environment, whose mistakes made are almost null.
All graphs, if read from left to right, present a convergence pattern from high to low error rates, which suggests that, for an always-on scenario, never resetting the models states seem to be beneficial. That may sound counter-intuitive but I am finding very hard to argue against these numbers. In addition, v4 shows even better performance than v3 in home environment in the long run, i.e., if the states of the model are never reset.

A possible takeaway could be that this whole speech-expectation thingy reflects the training scheme, since the model has probably not seen (or it has, but very rarely) instances of non-speech-only data after the LSTM states have been initialized. IOW, if the datasets used for training the VAD are the same ones used to train ASR systems, all data contains speech, and that's what the model expects to see at the end of the day.

Any feedback on these results would be welcome @snakers4 😄

snakers4 · 2024-06-27T18:54:16Z

A possible takeaway could be that this whole speech-expectation thingy reflects the training scheme, since the model has probably not seen (or it has, but very rarely) instances of non-speech-only data after the LSTM states have been initialized. IOW, if the datasets used for training the VAD are the same ones used to train ASR systems, all data contains speech, and that's what the model expects to see at the end of the day.

We focused on this scenario when training the new VAD since we had some datasets and our own issues when running noise only / "speechless" audios through the VAD.

The new VAD version was released just now - #2 (comment).

We changed the way it handles context now - we pass a part of the previous chunk as well as the current chunk and we made the LSTM component 2x smaller but improved the feature pyramid pooling (we has an improper pooling layer).

So in theory and in our practice the new VAD should work better with this edge case.

Can you please re-run some of your tests, and if the issue persists - please open a new issue referencing this one as context.

Many thanks!

snakers4 added the help wanted Extra attention is needed label Apr 22, 2024

snakers4 self-assigned this Apr 22, 2024

snakers4 pinned this issue Apr 22, 2024

snakers4 closed this as completed Jun 27, 2024

snakers4 unpinned this issue Jun 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

⚠️Public pre-test of Silero-VAD v5 #448

⚠️Public pre-test of Silero-VAD v5 #448

snakers4 commented Apr 22, 2024

rizwanishaq commented Apr 25, 2024

whaozl commented Apr 30, 2024

asusdisciple commented May 2, 2024

Purfview commented May 2, 2024 •

edited

Loading

cassiotbatista commented May 14, 2024 •

edited

Loading

snakers4 commented Jun 27, 2024

⚠️Public pre-test of Silero-VAD v5 #448

⚠️Public pre-test of Silero-VAD v5 #448

Comments

snakers4 commented Apr 22, 2024

rizwanishaq commented Apr 25, 2024

whaozl commented Apr 30, 2024

asusdisciple commented May 2, 2024

Purfview commented May 2, 2024 • edited Loading

cassiotbatista commented May 14, 2024 • edited Loading

snakers4 commented Jun 27, 2024

Purfview commented May 2, 2024 •

edited

Loading

cassiotbatista commented May 14, 2024 •

edited

Loading