-
Notifications
You must be signed in to change notification settings - Fork 412
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
This vad algorithm does not work well on Chinese data sets #449
Comments
Thanks for your comment! |
Hi! Can you tell me what is the reason why the voice activity detection module is so poor?Do the effects of this module depend heavily on the data set? |
tuning the parameters based on your dataset is necessary. If yours is quiet overall, try lower threshold and longer min_silence_samples, otherwise higher / shorter |
The new VAD version was released just now - #2 (comment). Now it was trained on more than 6,000 languages. Can you please test is on your data again. If the issue persists, please open a new issue referencing this one. Many thanks! |
I have tried two Chinese speaker diarization data sets but their results are not good, especially when the human voice is removed as noise. Can this be fine-tuned?
The code I used:
USE_ONNX = False # change this to True if you want to test onnx model
if USE_ONNX:
!pip install -q onnxruntime
model, utils = torch.hub.load(repo_or_dir='snakers4/silero-vad',
model='silero_vad',
force_reload=True,
onnx=USE_ONNX)
(get_speech_timestamps,
save_audio,
read_audio,
VADIterator,
collect_chunks) = utils
wav = read_audio('S_R004S04C01.wav', sampling_rate=SAMPLING_RATE)
get speech timestamps from full audio file
speech_timestamps = get_speech_timestamps(wav, model, sampling_rate=SAMPLING_RATE)
pprint(speech_timestamps)
using VADIterator class
vad_iterator = VADIterator(model)
wav = read_audio(f'S_R004S04C01.wav', sampling_rate=SAMPLING_RATE)
window_size_samples = 1536 # number of samples in a single audio chunk
for i in range(0, len(wav), window_size_samples):
chunk = wav[i: i+ window_size_samples]
if len(chunk) < window_size_samples:
break
speech_dict = vad_iterator(chunk, return_seconds=True)
if speech_dict:
print(speech_dict, end=' ')
vad_iterator.reset_states() # reset model states after each audio
The result on Alimeeting-Test:
MS: 20.299598, FA: 1.372215, SER: 1.088590, DER: 22.760403
MS: 31.277793, FA: 2.150170, SER: 1.933873, DER: 35.361836
MS: 31.944428, FA: 0.511342, SER: 2.276318, DER: 34.732088
MS: 47.038586, FA: 0.163343, SER: 9.470302, DER: 56.672231
MS: 74.286394, FA: 0.007934, SER: 3.434961, DER: 77.729289
MS: 30.688677, FA: 0.704153, SER: 2.770183, DER: 34.163013
MS: 59.316559, FA: 0.324209, SER: 8.123554, DER: 67.764322
MS: 98.369565, FA: 0.000000, SER: 0.562652, DER: 98.932217
MS: 99.417771, FA: 0.000000, SER: 0.058597, DER: 99.476368
MS: 99.910412, FA: 0.000000, SER: 0.000000, DER: 99.910412
MS: 99.493029, FA: 0.000000, SER: 0.120111, DER: 99.613140
MS: 61.856814, FA: 0.623673, SER: 0.184956, DER: 62.665443
MS: 19.090301, FA: 4.226608, SER: 3.039757, DER: 26.356666
MS: 33.685372, FA: 0.338829, SER: 0.267496, DER: 34.291696
MS: 15.374482, FA: 4.018866, SER: 0.518013, DER: 19.911360
MS: 42.467802, FA: 1.968425, SER: 0.268384, DER: 44.704612
MS: 17.370355, FA: 0.626849, SER: 0.326430, DER: 18.323634
MS: 67.082939, FA: 0.626243, SER: 0.180605, DER: 67.889787
MS: 72.216975, FA: 0.557994, SER: 0.130966, DER: 72.905935
MS: 14.936698, FA: 1.236910, SER: 0.225926, DER: 16.399534
The result on Aishell-4:
MS: 79.665430, FA: 0.012366, SER: 5.601830, DER: 85.279626
MS: 67.227370, FA: 0.132288, SER: 1.020209, DER: 68.379866
MS: 61.530820, FA: 18.205761, SER: 5.297353, DER: 85.033934
MS: 54.602609, FA: 0.152443, SER: 2.483539, DER: 57.238590
MS: 67.082935, FA: 0.078205, SER: 2.599719, DER: 69.760859
MS: 51.416720, FA: 0.204723, SER: 1.379586, DER: 53.001029
MS: 56.959476, FA: 0.203365, SER: 7.326404, DER: 64.489246
MS: 36.057926, FA: 0.157853, SER: 1.157691, DER: 37.373470
MS: 79.330646, FA: 0.097513, SER: 0.407194, DER: 79.835354
MS: 81.295235, FA: 0.062895, SER: 1.192822, DER: 82.550952
MS: 60.887943, FA: 0.599634, SER: 2.776542, DER: 64.264119
MS: 70.418660, FA: 0.084877, SER: 3.336644, DER: 73.840181
MS: 11.451400, FA: 0.658543, SER: 3.846325, DER: 15.956268
MS: 21.339103, FA: 0.351577, SER: 0.758447, DER: 22.449127
MS: 22.068026, FA: 0.588110, SER: 6.252810, DER: 28.908947
MS: 21.507885, FA: 0.162660, SER: 1.766586, DER: 23.437131
MS: 28.836928, FA: 0.203312, SER: 0.167732, DER: 29.207972
MS: 18.727860, FA: 0.238973, SER: 1.228832, DER: 20.195666
MS: 17.108661, FA: 0.269604, SER: 0.083678, DER: 17.461943
MS: 13.953794, FA: 0.308104, SER: 1.880523, DER: 16.142421
The text was updated successfully, but these errors were encountered: