Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

$300 - improve local transcription accuracy #431

Open
louis030195 opened this issue Oct 4, 2024 · 28 comments
Open

$300 - improve local transcription accuracy #431

louis030195 opened this issue Oct 4, 2024 · 28 comments

Comments

@louis030195
Copy link
Collaborator

louis030195 commented Oct 4, 2024

how we record & transcribe now:

  1. record chunk of audio of 30s on each device
  2. use local voice activity detection model to extract speech frames, if not enough, skip transcription
  3. transcribe speech frames
  4. encode audio to mp4
  5. save transcription + mp4 source to db

definition of done:

  • audio transcription accuracy is measured using best practice & a benchmark, e.g. we can say "audio transcription is 78% accurate" by running a benchmark command (usually you have a mp4 with some voice and know the transcript before hand and you just check how the model perform compared the expected, do some math and it gives you a %)
  • audio transcription accuracy is higher than now - (sorry if not very clear, say i think we can make it 20% better)
  • rest of the program is unchanged or little impact in terms of stability, resource usage, etc. and still works on MacOS and Windows, and preferably Linux

possible ways to increase accuracy:

  • overlap audio chunks
  • change/tweak duration
  • tweak VAD settings
  • tweak sample rate things
  • change/add new model

make sure to measure first, then optimise second, not the other way around, no "it looks better after my change", i only trust numbers, thank you

/bounty 300

@louis030195 louis030195 added the bug Something isn't working label Oct 4, 2024
Copy link

linear bot commented Oct 4, 2024

Copy link

algora-pbc bot commented Oct 4, 2024

💎 $300 bounty • Screenpi.pe

Steps to solve:

  1. Start working: Comment /attempt #431 with your implementation plan
  2. Submit work: Create a pull request including /claim #431 in the PR body to claim the bounty
  3. Receive payment: 100% of the bounty is received 2-5 days post-reward. Make sure you are eligible for payouts

Thank you for contributing to mediar-ai/screenpipe!

Add a bountyShare on socials

Attempt Started (GMT+0) Solution
🔴 @EzraEllette Oct 7, 2024, 4:49:13 AM WIP

@louis030195 louis030195 removed the bug Something isn't working label Oct 4, 2024
@TanGentleman
Copy link
Contributor

I've heard that a well implemented form of the new Whisper Turbo is very well optimized and significantly faster than realtime, definitely so on my hardware (an M1 Mac).
openai/whisper#2363
https://www.markhneedham.com/blog/2024/10/02/insanely-fast-whisper-running-openai-whisper-turbo-mac/

@louis030195
Copy link
Collaborator Author

I've heard that a well implemented form of the new Whisper Turbo is very well optimized and significantly faster than realtime, definitely so on my hardware (an M1 Mac). openai/whisper#2363 https://www.markhneedham.com/blog/2024/10/02/insanely-fast-whisper-running-openai-whisper-turbo-mac/

#413 yes but this is more for quality i'm thinking of here than speed

@NicodemPL
Copy link

Certainly, we should prioritize quality. I discovered a notable difference in output quality when using the same Whisper model in Screenpipe compared to processing a full recorded file, rather than the current implementation which uses small chunks.
Done some digging and seems that overlap for 2sec and check for repetitive words before finalizing transcript should do the job.

@EzraEllette
Copy link
Contributor

EzraEllette commented Oct 7, 2024

/attempt #431

Algora profile Completed bounties Tech Active attempts Options
@EzraEllette 1 mediar-ai bounty
TypeScript, Rust,
JavaScript & more
Cancel attempt

@mlasy
Copy link

mlasy commented Oct 8, 2024

i noted that the existing whisper-large model only works with english. Could we support different languages or different whisper models as well?

@louis030195
Copy link
Collaborator Author

/attempt #431

Algora profile Completed bounties Tech Active attempts Options
@EzraEllette 1 mediar-ai bounty
TypeScript, Rust,
JavaScript & more
Cancel attempt

any update on this?

i want to add diarization using https://github.com/thewh1teagle/pyannote-rs but might overlap with this issue

@EzraEllette
Copy link
Contributor

Diarization is also something I want from screenpipe as well as speaker verification. I meant to cancel my attempt on this since I haven't had the time to work on it, but algora's cancel button doesn't work.

@louis030195
Copy link
Collaborator Author

Oh okay, I'll have a look at this issue + diarization, etc then

@EzraEllette
Copy link
Contributor

@louis030195 Are you actively working on this? I will have some time to get started tonight.

@louis030195
Copy link
Collaborator Author

@EzraEllette a bit, i did a simple unit test to measure accuracy of short wav files but it's not really fully reflecting real screenpipe usage

https://github.com/mediar-ai/screenpipe/blob/main/screenpipe-audio/tests/accuracy_test.rs

i improved a bit accuracy using normalization of audio from 57% to 62% (while deepgram is 82%) using whisper turbo

been thinking in switching from candle to whisper-cpp (binding) and adding diarization etc. by just copy pasting the code here https://github.com/thewh1teagle/vibe/blob/main/core/src/transcribe.rs

something else we'd like to do in the future also is to be able to stream transcriptions through a websocket for example in the server for different use cases which might affect the architecture but i think main priority rn is to improve quality

i won't have much time left today and tomorrow morning got bunch of calls so feel free to try things

@EzraEllette
Copy link
Contributor

I have been meaning to make a pipe, but i need realtime data.

I'll pull what you have and see what I can do. Also, I'm going to get a baseline on whisper-cpp's accuracy and go from there.

@EzraEllette
Copy link
Contributor

@louis030195 I'm seeing ~5% improvement with spectral subtraction using the last 100ms frame of unknown status. I might try using the last few hundred ms rather than just one, but for now here this is:
image

@EzraEllette
Copy link
Contributor

I'm implementing dynamic range compression to see if that helps.

@EzraEllette
Copy link
Contributor

not seeing a difference

@NicodemPL
Copy link

Strange. Usually I get very good results with whisper large but working with long files (15min+)
Have you tried to exceed 30s timewindow with 2 sec overlap?

@EzraEllette
Copy link
Contributor

EzraEllette commented Oct 10, 2024

There are errors in the middle of the transcripts so I am focusing on those through audio preprocessing.

@EzraEllette
Copy link
Contributor

I should mention that I changed the sinc interpolation to cubic, which is drastically slower than linear. I updated my PR to reflect that.

I'm trying some other sampling changes but I'm doubtful that it will improve anything.

@EzraEllette
Copy link
Contributor

Deepgram result:
image
image
At least we beat deepgram on the last sample 😆

@EzraEllette
Copy link
Contributor

It's worth mentioning that the Levenshtein distance would be lower if we sanitized the transcription output to remove the hallucinations and timestamps.

I think we can assume that if a transcript has two segments with the same timestamp, the shorter section should be removed. Other than that I'm not sure what you want to do with the timestamps.

@louis030195
Copy link
Collaborator Author

i think one of the common issue with screenpipe is when someone is speaking then stop then start again in a 30s chunk, whisper will create "Thank you" in silences, that's one thing we should solve somehow through some audio processing hacks i guess

regarding current accuracy metrics i think we could have a second unit test that contains audio recording from screenpipe like 4 and either write the expected transcript manually or use some online transcription service to create the transcript (which make mistake), for current unit test, some of the expected have been done with deepgram which makes a bit of mistake too

honestly even as a human i struggle to transcribe some of the audio recordings sometimes when people have weird accent

also something else we could eventually do is to fix transcript with LLM in real time but I'd expect it a hard task to do it well as it shouldn't take more than 1gb memory and not adding hallucination, not overloading GPU/CPU etc.

another reason i wanted to switch to whisper cpp is that they have more feature like initial prompt:

ggerganov/whisper.cpp#348

https://github.com/thewh1teagle/vibe/blob/28b17d2dd9f1ffea148731be3e12d7a4efd433f4/core/src/transcribe.rs#L114

which we could put in screenpipe cli arg and app ui settings like "yo my name is louis sometimes i talk about screenpipe, i have french accent so make sure to take this into account ..."

while candle is really barebones we have to reimplement everything ourselves sadly and we don't have time to turn into AI researcher at this point

i guess diarization would improve a little bit accuracy also by running transcription only on frames that belong to specific voice

some rough thoughts, what do you think are the next steps @EzraEllette ?

@EzraEllette
Copy link
Contributor

@louis030195 I have a couple meetings tonight but I'll give you some information afterwards.

Right now it makes more sense to use tools that have more features and are actively maintained by other developers when possible.

I'll contact you once my meetings are finished.

@louis030195
Copy link
Collaborator Author

@EzraEllette

do you want to refactor to always record audio + send chunks for transcription?

also interested if there could be a way to be able to stream audio + transcription through the API for extensions reasons

also the #170 use case is important

@EzraEllette
Copy link
Contributor

@EzraEllette

do you want to refactor to always record audio + send chunks for transcription?

also interested if there could be a way to be able to stream audio + transcription through the API for extensions reasons

also the #170 use case is important

Yes. I want to make that refactor And explore streaming.

@louis030195
Copy link
Collaborator Author

louis030195 commented Oct 15, 2024

adding some context

some user feedback:
image

some users had issue with language eg #451 but i think #469 would solve it?

diarization: https://github.com/thewh1teagle/pyannote-rs - can probably slightly increase accuracy too

other issues with audio:

image

on my side i want to prioritize having high quality data infrastructure for audio that works across OSes ideally (MacOS, Windows at least) and UI things is less priority

@EzraEllette
Copy link
Contributor

Speaker Identification and Diarization will be a large undertaking.

Chunking the audio and overlapping is working for now.

Here are some of my thoughts about streaming audio data:

  • Refactoring audio recording to stream the audio data will enable other parts of screenpipe to ingest audio recording data.

  • in addition to streaming, instead of cutting off the audio at X seconds, after X seconds, wait for a pause using VAD to start a new chunk. That way transcriptions won't end in the middle of a sentence. Overlap is effective, but may not be necessary with streaming and using VAD to find a good break in the audio.

@louis030195
Copy link
Collaborator Author

Speaker Identification and Diarization will be a large undertaking.

Chunking the audio and overlapping is working for now.

Here are some of my thoughts about streaming audio data:

  • Refactoring audio recording to stream the audio data will enable other parts of screenpipe to ingest audio recording data.
  • in addition to streaming, instead of cutting off the audio at X seconds, after X seconds, wait for a pause using VAD to start a new chunk. That way transcriptions won't end in the middle of a sentence. Overlap is effective, but may not be necessary with streaming and using VAD to find a good break in the audio.

agree, let's not do speaker Identification and diarization for now

agree with streaming

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants