Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

diarization capabilities #406

Open
mussaj opened this issue Nov 21, 2024 · 3 comments
Open

diarization capabilities #406

mussaj opened this issue Nov 21, 2024 · 3 comments
Assignees
Labels

Comments

@mussaj
Copy link

mussaj commented Nov 21, 2024

Describe the feature

Hi whiteeagle, I have another request. Is it possible to implement pyannote for speaker diarization instead of Sherpa-ONNX? I saw mention of this in other contexts, but comparing both, I noticed that diarization using pyannote was much more accurate in separating the speakers and transcribed a bit more effectively. I had tried around with playing with the speaker detection threshold and transcription model (downloaded large model [not turbo]) and compared to a medium model on another software, known as 'aTrain' on github (which uses pyannote and fast whisper, medium model), the difference was very noticeable, and I tested this with almost a dozen files. Please let us know if this is possible. I love the interface of Vibe and the fact that it can handle batch transcriptions, but compared to the transcription/diarization, aTrain seemed to outperform- maybe something to look into. Will continue to use Vibe since it supports multiple files, which is what I directly need (aTrain only does one at a time) and the interface is much nicer, but may be worth exploring their configurations. Keep up the good work!

@thewh1teagle
Copy link
Owner

thewh1teagle commented Nov 21, 2024

Hey @mussaj,

Thanks for bringing this up. I’m aware of the diarization inaccuracies in Vibe. Currently, we don't use sherpa-onnx; instead, we use a Rust library I created called pyannote-rs. That said, it's possible that sherpa-onnx could offer better accuracy.

I’ve also worked on a C++ project that uses sherpa-onnx for diarization, and it might be worth comparing the results with the same files to see if it's more accurate. If so, we can look into integrating sherpa-onnx into Vibe.

I have ready-to-use binaries available in that project. You can find it here: https://github.com/thewh1teagle/loud.cpp

Let me know if you want to give it a try!

CC @altunenes,
Do you have any insights on this? I recall that you compared them before.

@altunenes
Copy link

Hi again! the development of sherpa-rs is exciting! However, as the author mentioned, I think pyannote-rs is still much better (it's already using vibe as you mentioned). I still use pyannote for speech detection and speaker ID with whisper-rs(thanks to Vulkan support). I think the main problem with Speaker ID is more related to parallel speech or some "unbalanced" audio data (see discussion). Because if we use bigger speaker ID models, the result is pretty much the same because the segmentation for parallel speech is not good for dealing with the such scenarios. This may be a wrong conclusion, but this is what I came to after my tests. 🙂 I tried some of Gstreamer's complex audio normalization pipelines (experimentally), but in some cases, it made the results better, while in other cases, it made them worse. So I concluded that we need a better speech detection model to deal with parallel speech/noises so feeding the Speaker ID model with better segmented audio will give better results.

In short, if you're working on a specific audio file, I'm afraid you need to do some work on audio normalization. I still haven't come across a general solution. I mean, I haven't seen a general solution on how the best audio normalization should be. But vibe's method works very well for many situations:

pub fn normalize(input: PathBuf, output: PathBuf, additional_ffmpeg_args: Option<Vec<String>>) -> Result<()> {
If you know or have read somewhere, please share. Because in the current situation, audio normalization has a significant effect on detection ACC. 😊

@mussaj
Copy link
Author

mussaj commented Nov 21, 2024

You may be right. I am a novice at all of this, but I noticed when the speakers spoke around the same time or briefly after one another, there were issues detected different speakers. May be related to segmentation/normalization as you said. Hoping for a fix down the pipeline!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants