-
-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
diarization capabilities #406
Comments
Hey @mussaj, Thanks for bringing this up. I’m aware of the diarization inaccuracies in Vibe. Currently, we don't use sherpa-onnx; instead, we use a Rust library I created called pyannote-rs. That said, it's possible that sherpa-onnx could offer better accuracy. I’ve also worked on a C++ project that uses sherpa-onnx for diarization, and it might be worth comparing the results with the same files to see if it's more accurate. If so, we can look into integrating sherpa-onnx into Vibe. I have ready-to-use binaries available in that project. You can find it here: https://github.com/thewh1teagle/loud.cpp Let me know if you want to give it a try! CC @altunenes, |
Hi again! the development of sherpa-rs is exciting! However, as the author mentioned, I think pyannote-rs is still much better (it's already using vibe as you mentioned). I still use pyannote for speech detection and speaker ID with whisper-rs(thanks to Vulkan support). I think the main problem with Speaker ID is more related to parallel speech or some "unbalanced" audio data (see discussion). Because if we use bigger speaker ID models, the result is pretty much the same because the segmentation for parallel speech is not good for dealing with the such scenarios. This may be a wrong conclusion, but this is what I came to after my tests. 🙂 I tried some of Gstreamer's complex audio normalization pipelines (experimentally), but in some cases, it made the results better, while in other cases, it made them worse. So I concluded that we need a better speech detection model to deal with parallel speech/noises so feeding the Speaker ID model with better segmented audio will give better results. In short, if you're working on a specific audio file, I'm afraid you need to do some work on audio normalization. I still haven't come across a general solution. I mean, I haven't seen a general solution on how the best audio normalization should be. But vibe's method works very well for many situations: Line 54 in cd05a7e
|
You may be right. I am a novice at all of this, but I noticed when the speakers spoke around the same time or briefly after one another, there were issues detected different speakers. May be related to segmentation/normalization as you said. Hoping for a fix down the pipeline! |
Describe the feature
Hi whiteeagle, I have another request. Is it possible to implement pyannote for speaker diarization instead of Sherpa-ONNX? I saw mention of this in other contexts, but comparing both, I noticed that diarization using pyannote was much more accurate in separating the speakers and transcribed a bit more effectively. I had tried around with playing with the speaker detection threshold and transcription model (downloaded large model [not turbo]) and compared to a medium model on another software, known as 'aTrain' on github (which uses pyannote and fast whisper, medium model), the difference was very noticeable, and I tested this with almost a dozen files. Please let us know if this is possible. I love the interface of Vibe and the fact that it can handle batch transcriptions, but compared to the transcription/diarization, aTrain seemed to outperform- maybe something to look into. Will continue to use Vibe since it supports multiple files, which is what I directly need (aTrain only does one at a time) and the interface is much nicer, but may be worth exploring their configurations. Keep up the good work!
The text was updated successfully, but these errors were encountered: