diarization capabilities #406

mussaj · 2024-11-21T17:05:11Z

Describe the feature

Hi whiteeagle, I have another request. Is it possible to implement pyannote for speaker diarization instead of Sherpa-ONNX? I saw mention of this in other contexts, but comparing both, I noticed that diarization using pyannote was much more accurate in separating the speakers and transcribed a bit more effectively. I had tried around with playing with the speaker detection threshold and transcription model (downloaded large model [not turbo]) and compared to a medium model on another software, known as 'aTrain' on github (which uses pyannote and fast whisper, medium model), the difference was very noticeable, and I tested this with almost a dozen files. Please let us know if this is possible. I love the interface of Vibe and the fact that it can handle batch transcriptions, but compared to the transcription/diarization, aTrain seemed to outperform- maybe something to look into. Will continue to use Vibe since it supports multiple files, which is what I directly need (aTrain only does one at a time) and the interface is much nicer, but may be worth exploring their configurations. Keep up the good work!

thewh1teagle · 2024-11-21T17:15:56Z

Hey @mussaj,

Thanks for bringing this up. I’m aware of the diarization inaccuracies in Vibe. Currently, we don't use sherpa-onnx; instead, we use a Rust library I created called pyannote-rs. That said, it's possible that sherpa-onnx could offer better accuracy.

I’ve also worked on a C++ project that uses sherpa-onnx for diarization, and it might be worth comparing the results with the same files to see if it's more accurate. If so, we can look into integrating sherpa-onnx into Vibe.

I have ready-to-use binaries available in that project. You can find it here: https://github.com/thewh1teagle/loud.cpp

Let me know if you want to give it a try!

CC @altunenes,
Do you have any insights on this? I recall that you compared them before.

altunenes · 2024-11-21T17:49:21Z

Hi again! the development of sherpa-rs is exciting! However, as the author mentioned, I think pyannote-rs is still much better (it's already using vibe as you mentioned). I still use pyannote for speech detection and speaker ID with whisper-rs(thanks to Vulkan support). I think the main problem with Speaker ID is more related to parallel speech or some "unbalanced" audio data (see discussion). Because if we use bigger speaker ID models, the result is pretty much the same because the segmentation for parallel speech is not good for dealing with the such scenarios. This may be a wrong conclusion, but this is what I came to after my tests. 🙂 I tried some of Gstreamer's complex audio normalization pipelines (experimentally), but in some cases, it made the results better, while in other cases, it made them worse. So I concluded that we need a better speech detection model to deal with parallel speech/noises so feeding the Speaker ID model with better segmented audio will give better results.

In short, if you're working on a specific audio file, I'm afraid you need to do some work on audio normalization. I still haven't come across a general solution. I mean, I haven't seen a general solution on how the best audio normalization should be. But vibe's method works very well for many situations:

vibe/core/src/audio.rs

Line 54 in cd05a7e

    
           pub fn normalize(input: PathBuf, output: PathBuf, additional_ffmpeg_args: Option<Vec<String>>) -> Result<()> {

If you know or have read somewhere, please share. Because in the current situation, audio normalization has a significant effect on detection ACC. 😊

mussaj · 2024-11-21T19:00:29Z

You may be right. I am a novice at all of this, but I noticed when the speakers spoke around the same time or briefly after one another, there were issues detected different speakers. May be related to segmentation/normalization as you said. Hoping for a fix down the pipeline!

mussaj added the feature label Nov 21, 2024

mussaj assigned thewh1teagle Nov 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

diarization capabilities #406

diarization capabilities #406

mussaj commented Nov 21, 2024

thewh1teagle commented Nov 21, 2024 •

edited

Loading

altunenes commented Nov 21, 2024

mussaj commented Nov 21, 2024

diarization capabilities #406

diarization capabilities #406

Comments

mussaj commented Nov 21, 2024

Describe the feature

thewh1teagle commented Nov 21, 2024 • edited Loading

altunenes commented Nov 21, 2024

mussaj commented Nov 21, 2024

thewh1teagle commented Nov 21, 2024 •

edited

Loading