-
Notifications
You must be signed in to change notification settings - Fork 511
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
$300 - improve local transcription accuracy #431
Comments
💎 $300 bounty • Screenpi.peSteps to solve:
Thank you for contributing to mediar-ai/screenpipe! Add a bounty • Share on socials
|
I've heard that a well implemented form of the new Whisper Turbo is very well optimized and significantly faster than realtime, definitely so on my hardware (an M1 Mac). |
#413 yes but this is more for quality i'm thinking of here than speed |
Certainly, we should prioritize quality. I discovered a notable difference in output quality when using the same Whisper model in Screenpipe compared to processing a full recorded file, rather than the current implementation which uses small chunks. |
/attempt #431
|
i noted that the existing whisper-large model only works with english. Could we support different languages or different whisper models as well? |
any update on this? i want to add diarization using https://github.com/thewh1teagle/pyannote-rs but might overlap with this issue |
Diarization is also something I want from screenpipe as well as speaker verification. I meant to cancel my attempt on this since I haven't had the time to work on it, but algora's cancel button doesn't work. |
Oh okay, I'll have a look at this issue + diarization, etc then |
@louis030195 Are you actively working on this? I will have some time to get started tonight. |
@EzraEllette a bit, i did a simple unit test to measure accuracy of short wav files but it's not really fully reflecting real screenpipe usage https://github.com/mediar-ai/screenpipe/blob/main/screenpipe-audio/tests/accuracy_test.rs i improved a bit accuracy using normalization of audio from 57% to 62% (while deepgram is 82%) using whisper turbo been thinking in switching from candle to whisper-cpp (binding) and adding diarization etc. by just copy pasting the code here https://github.com/thewh1teagle/vibe/blob/main/core/src/transcribe.rs something else we'd like to do in the future also is to be able to stream transcriptions through a websocket for example in the server for different use cases which might affect the architecture but i think main priority rn is to improve quality i won't have much time left today and tomorrow morning got bunch of calls so feel free to try things |
I have been meaning to make a pipe, but i need realtime data. I'll pull what you have and see what I can do. Also, I'm going to get a baseline on whisper-cpp's accuracy and go from there. |
@louis030195 I'm seeing ~5% improvement with spectral subtraction using the last 100ms frame of unknown status. I might try using the last few hundred ms rather than just one, but for now here this is: |
I'm implementing dynamic range compression to see if that helps. |
not seeing a difference |
Strange. Usually I get very good results with whisper large but working with long files (15min+) |
There are errors in the middle of the transcripts so I am focusing on those through audio preprocessing. |
I should mention that I changed the sinc interpolation to cubic, which is drastically slower than linear. I updated my PR to reflect that. I'm trying some other sampling changes but I'm doubtful that it will improve anything. |
It's worth mentioning that the Levenshtein distance would be lower if we sanitized the transcription output to remove the hallucinations and timestamps. I think we can assume that if a transcript has two segments with the same timestamp, the shorter section should be removed. Other than that I'm not sure what you want to do with the timestamps. |
i think one of the common issue with screenpipe is when someone is speaking then stop then start again in a 30s chunk, whisper will create "Thank you" in silences, that's one thing we should solve somehow through some audio processing hacks i guess regarding current accuracy metrics i think we could have a second unit test that contains audio recording from screenpipe like 4 and either write the expected transcript manually or use some online transcription service to create the transcript (which make mistake), for current unit test, some of the expected have been done with deepgram which makes a bit of mistake too honestly even as a human i struggle to transcribe some of the audio recordings sometimes when people have weird accent also something else we could eventually do is to fix transcript with LLM in real time but I'd expect it a hard task to do it well as it shouldn't take more than 1gb memory and not adding hallucination, not overloading GPU/CPU etc. another reason i wanted to switch to whisper cpp is that they have more feature like initial prompt: which we could put in screenpipe cli arg and app ui settings like "yo my name is louis sometimes i talk about screenpipe, i have french accent so make sure to take this into account ..." while candle is really barebones we have to reimplement everything ourselves sadly and we don't have time to turn into AI researcher at this point i guess diarization would improve a little bit accuracy also by running transcription only on frames that belong to specific voice some rough thoughts, what do you think are the next steps @EzraEllette ? |
@louis030195 I have a couple meetings tonight but I'll give you some information afterwards. Right now it makes more sense to use tools that have more features and are actively maintained by other developers when possible. I'll contact you once my meetings are finished. |
do you want to refactor to always record audio + send chunks for transcription? also interested if there could be a way to be able to stream audio + transcription through the API for extensions reasons also the #170 use case is important |
Yes. I want to make that refactor And explore streaming. |
adding some context some users had issue with language eg #451 but i think #469 would solve it? diarization: https://github.com/thewh1teagle/pyannote-rs - can probably slightly increase accuracy too other issues with audio:
on my side i want to prioritize having high quality data infrastructure for audio that works across OSes ideally (MacOS, Windows at least) and UI things is less priority |
Speaker Identification and Diarization will be a large undertaking. Chunking the audio and overlapping is working for now. Here are some of my thoughts about streaming audio data:
|
agree, let's not do speaker Identification and diarization for now agree with streaming |
how we record & transcribe now:
definition of done:
possible ways to increase accuracy:
make sure to measure first, then optimise second, not the other way around, no "it looks better after my change", i only trust numbers, thank you
/bounty 300
The text was updated successfully, but these errors were encountered: