$300 - improve local transcription accuracy #431

louis030195 · 2024-10-04T18:56:22Z

how we record & transcribe now:

record chunk of audio of 30s on each device
use local voice activity detection model to extract speech frames, if not enough, skip transcription
transcribe speech frames
encode audio to mp4
save transcription + mp4 source to db

definition of done:

audio transcription accuracy is measured using best practice & a benchmark, e.g. we can say "audio transcription is 78% accurate" by running a benchmark command (usually you have a mp4 with some voice and know the transcript before hand and you just check how the model perform compared the expected, do some math and it gives you a %)
audio transcription accuracy is higher than now - (sorry if not very clear, say i think we can make it 20% better)
rest of the program is unchanged or little impact in terms of stability, resource usage, etc. and still works on MacOS and Windows, and preferably Linux

possible ways to increase accuracy:

overlap audio chunks
change/tweak duration
tweak VAD settings
tweak sample rate things
change/add new model

make sure to measure first, then optimise second, not the other way around, no "it looks better after my change", i only trust numbers, thank you

/bounty 300

linear · 2024-10-04T18:56:25Z

MED-156 $300 - improve local transcription accuracy

algora-pbc · 2024-10-04T18:56:28Z

💎 $300 bounty • Screenpi.pe

Steps to solve:

Start working: Comment /attempt #431 with your implementation plan
Submit work: Create a pull request including /claim #431 in the PR body to claim the bounty
Receive payment: 100% of the bounty is received 2-5 days post-reward. Make sure you are eligible for payouts

Thank you for contributing to mediar-ai/screenpipe!

Add a bounty • Share on socials

Attempt	Started (GMT+0)	Solution
🔴 @EzraEllette	Oct 7, 2024, 4:49:13 AM	WIP

TanGentleman · 2024-10-04T19:00:16Z

I've heard that a well implemented form of the new Whisper Turbo is very well optimized and significantly faster than realtime, definitely so on my hardware (an M1 Mac).
openai/whisper#2363
https://www.markhneedham.com/blog/2024/10/02/insanely-fast-whisper-running-openai-whisper-turbo-mac/

louis030195 · 2024-10-05T00:34:18Z

I've heard that a well implemented form of the new Whisper Turbo is very well optimized and significantly faster than realtime, definitely so on my hardware (an M1 Mac). openai/whisper#2363 https://www.markhneedham.com/blog/2024/10/02/insanely-fast-whisper-running-openai-whisper-turbo-mac/

#413 yes but this is more for quality i'm thinking of here than speed

NicodemPL · 2024-10-05T07:57:50Z

Certainly, we should prioritize quality. I discovered a notable difference in output quality when using the same Whisper model in Screenpipe compared to processing a full recorded file, rather than the current implementation which uses small chunks.
Done some digging and seems that overlap for 2sec and check for repetitive words before finalizing transcript should do the job.

EzraEllette · 2024-10-07T04:49:10Z

/attempt #431

Algora profile	Completed bounties	Tech	Active attempts	Options
@EzraEllette	1 mediar-ai bounty	TypeScript, Rust, JavaScript & more		Cancel attempt

mlasy · 2024-10-08T11:49:20Z

i noted that the existing whisper-large model only works with english. Could we support different languages or different whisper models as well?

louis030195 · 2024-10-09T17:52:17Z

/attempt #431

Algora profile Completed bounties Tech Active attempts Options
@EzraEllette 1 mediar-ai bounty
TypeScript, Rust,
JavaScript & more
Cancel attempt

any update on this?

i want to add diarization using https://github.com/thewh1teagle/pyannote-rs but might overlap with this issue

EzraEllette · 2024-10-09T17:54:59Z

Diarization is also something I want from screenpipe as well as speaker verification. I meant to cancel my attempt on this since I haven't had the time to work on it, but algora's cancel button doesn't work.

louis030195 · 2024-10-09T17:56:56Z

Oh okay, I'll have a look at this issue + diarization, etc then

EzraEllette · 2024-10-10T00:31:11Z

@louis030195 Are you actively working on this? I will have some time to get started tonight.

louis030195 · 2024-10-10T00:41:50Z

@EzraEllette a bit, i did a simple unit test to measure accuracy of short wav files but it's not really fully reflecting real screenpipe usage

https://github.com/mediar-ai/screenpipe/blob/main/screenpipe-audio/tests/accuracy_test.rs

i improved a bit accuracy using normalization of audio from 57% to 62% (while deepgram is 82%) using whisper turbo

been thinking in switching from candle to whisper-cpp (binding) and adding diarization etc. by just copy pasting the code here https://github.com/thewh1teagle/vibe/blob/main/core/src/transcribe.rs

something else we'd like to do in the future also is to be able to stream transcriptions through a websocket for example in the server for different use cases which might affect the architecture but i think main priority rn is to improve quality

i won't have much time left today and tomorrow morning got bunch of calls so feel free to try things

EzraEllette · 2024-10-10T01:36:15Z

I have been meaning to make a pipe, but i need realtime data.

I'll pull what you have and see what I can do. Also, I'm going to get a baseline on whisper-cpp's accuracy and go from there.

EzraEllette · 2024-10-10T09:35:07Z

@louis030195 I'm seeing ~5% improvement with spectral subtraction using the last 100ms frame of unknown status. I might try using the last few hundred ms rather than just one, but for now here this is:

EzraEllette · 2024-10-10T09:40:43Z

I'm implementing dynamic range compression to see if that helps.

EzraEllette · 2024-10-10T09:45:19Z

not seeing a difference

NicodemPL · 2024-10-10T09:51:30Z

Strange. Usually I get very good results with whisper large but working with long files (15min+)
Have you tried to exceed 30s timewindow with 2 sec overlap?

EzraEllette · 2024-10-10T09:58:09Z

There are errors in the middle of the transcripts so I am focusing on those through audio preprocessing.

EzraEllette · 2024-10-10T10:07:23Z

I should mention that I changed the sinc interpolation to cubic, which is drastically slower than linear. I updated my PR to reflect that.

I'm trying some other sampling changes but I'm doubtful that it will improve anything.

EzraEllette · 2024-10-10T10:26:02Z

Deepgram result:

At least we beat deepgram on the last sample 😆

EzraEllette · 2024-10-10T10:57:17Z

It's worth mentioning that the Levenshtein distance would be lower if we sanitized the transcription output to remove the hallucinations and timestamps.

I think we can assume that if a transcript has two segments with the same timestamp, the shorter section should be removed. Other than that I'm not sure what you want to do with the timestamps.

louis030195 · 2024-10-10T17:57:57Z

i think one of the common issue with screenpipe is when someone is speaking then stop then start again in a 30s chunk, whisper will create "Thank you" in silences, that's one thing we should solve somehow through some audio processing hacks i guess

regarding current accuracy metrics i think we could have a second unit test that contains audio recording from screenpipe like 4 and either write the expected transcript manually or use some online transcription service to create the transcript (which make mistake), for current unit test, some of the expected have been done with deepgram which makes a bit of mistake too

honestly even as a human i struggle to transcribe some of the audio recordings sometimes when people have weird accent

also something else we could eventually do is to fix transcript with LLM in real time but I'd expect it a hard task to do it well as it shouldn't take more than 1gb memory and not adding hallucination, not overloading GPU/CPU etc.

another reason i wanted to switch to whisper cpp is that they have more feature like initial prompt:

ggerganov/whisper.cpp#348

https://github.com/thewh1teagle/vibe/blob/28b17d2dd9f1ffea148731be3e12d7a4efd433f4/core/src/transcribe.rs#L114

which we could put in screenpipe cli arg and app ui settings like "yo my name is louis sometimes i talk about screenpipe, i have french accent so make sure to take this into account ..."

while candle is really barebones we have to reimplement everything ourselves sadly and we don't have time to turn into AI researcher at this point

i guess diarization would improve a little bit accuracy also by running transcription only on frames that belong to specific voice

some rough thoughts, what do you think are the next steps @EzraEllette ?

EzraEllette · 2024-10-10T22:49:31Z

@louis030195 I have a couple meetings tonight but I'll give you some information afterwards.

Right now it makes more sense to use tools that have more features and are actively maintained by other developers when possible.

I'll contact you once my meetings are finished.

louis030195 · 2024-10-14T18:15:43Z

@EzraEllette

do you want to refactor to always record audio + send chunks for transcription?

also interested if there could be a way to be able to stream audio + transcription through the API for extensions reasons

also the #170 use case is important

EzraEllette · 2024-10-14T19:54:59Z

@EzraEllette

do you want to refactor to always record audio + send chunks for transcription?

also interested if there could be a way to be able to stream audio + transcription through the API for extensions reasons

also the #170 use case is important

Yes. I want to make that refactor And explore streaming.

louis030195 · 2024-10-15T16:31:20Z

adding some context

some user feedback:

some users had issue with language eg #451 but i think #469 would solve it?

diarization: https://github.com/thewh1teagle/pyannote-rs - can probably slightly increase accuracy too

other issues with audio:

deepgram does not work with macos Display audio (output) for me (i think for weeks)
for windows some users transcription does not work at all i need help making audio transcription work well on windows (bounty size TBD yet) #374
another windows audio issue, not sure what he meant

on my side i want to prioritize having high quality data infrastructure for audio that works across OSes ideally (MacOS, Windows at least) and UI things is less priority

EzraEllette · 2024-10-15T20:27:30Z

Speaker Identification and Diarization will be a large undertaking.

Chunking the audio and overlapping is working for now.

Here are some of my thoughts about streaming audio data:

Refactoring audio recording to stream the audio data will enable other parts of screenpipe to ingest audio recording data.
in addition to streaming, instead of cutting off the audio at X seconds, after X seconds, wait for a pause using VAD to start a new chunk. That way transcriptions won't end in the middle of a sentence. Overlap is effective, but may not be necessary with streaming and using VAD to find a good break in the audio.

louis030195 · 2024-10-15T21:09:07Z

Speaker Identification and Diarization will be a large undertaking.

Chunking the audio and overlapping is working for now.

Here are some of my thoughts about streaming audio data:

Refactoring audio recording to stream the audio data will enable other parts of screenpipe to ingest audio recording data.

in addition to streaming, instead of cutting off the audio at X seconds, after X seconds, wait for a pause using VAD to start a new chunk. That way transcriptions won't end in the middle of a sentence. Overlap is effective, but may not be necessary with streaming and using VAD to find a good break in the audio.

agree, let's not do speaker Identification and diarization for now

agree with streaming

louis030195 added the bug Something isn't working label Oct 4, 2024

algora-pbc bot added the 💎 Bounty label Oct 4, 2024

louis030195 removed the bug Something isn't working label Oct 4, 2024

louis030195 mentioned this issue Oct 17, 2024

[bounty] impl STT streaming #521

Closed

louis030195 mentioned this issue Oct 26, 2024

audio streaming api #578

Closed

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

$300 - improve local transcription accuracy #431

$300 - improve local transcription accuracy #431

louis030195 commented Oct 4, 2024 •

edited

Loading

linear bot commented Oct 4, 2024

algora-pbc bot commented Oct 4, 2024 •

edited

Loading

TanGentleman commented Oct 4, 2024

louis030195 commented Oct 5, 2024

NicodemPL commented Oct 5, 2024

EzraEllette commented Oct 7, 2024 •

edited by algora-pbc bot

Loading

mlasy commented Oct 8, 2024

louis030195 commented Oct 9, 2024

EzraEllette commented Oct 9, 2024

louis030195 commented Oct 9, 2024

EzraEllette commented Oct 10, 2024

louis030195 commented Oct 10, 2024

EzraEllette commented Oct 10, 2024

EzraEllette commented Oct 10, 2024

EzraEllette commented Oct 10, 2024

EzraEllette commented Oct 10, 2024

NicodemPL commented Oct 10, 2024

EzraEllette commented Oct 10, 2024 •

edited

Loading

EzraEllette commented Oct 10, 2024

EzraEllette commented Oct 10, 2024

EzraEllette commented Oct 10, 2024

louis030195 commented Oct 10, 2024

EzraEllette commented Oct 10, 2024

louis030195 commented Oct 14, 2024

EzraEllette commented Oct 14, 2024

louis030195 commented Oct 15, 2024 •

edited

Loading

EzraEllette commented Oct 15, 2024

louis030195 commented Oct 15, 2024

$300 - improve local transcription accuracy #431

$300 - improve local transcription accuracy #431

Comments

louis030195 commented Oct 4, 2024 • edited Loading

linear bot commented Oct 4, 2024

algora-pbc bot commented Oct 4, 2024 • edited Loading

💎 $300 bounty • Screenpi.pe

Steps to solve:

TanGentleman commented Oct 4, 2024

louis030195 commented Oct 5, 2024

NicodemPL commented Oct 5, 2024

EzraEllette commented Oct 7, 2024 • edited by algora-pbc bot Loading

mlasy commented Oct 8, 2024

louis030195 commented Oct 9, 2024

EzraEllette commented Oct 9, 2024

louis030195 commented Oct 9, 2024

EzraEllette commented Oct 10, 2024

louis030195 commented Oct 10, 2024

EzraEllette commented Oct 10, 2024

EzraEllette commented Oct 10, 2024

EzraEllette commented Oct 10, 2024

EzraEllette commented Oct 10, 2024

NicodemPL commented Oct 10, 2024

EzraEllette commented Oct 10, 2024 • edited Loading

EzraEllette commented Oct 10, 2024

EzraEllette commented Oct 10, 2024

EzraEllette commented Oct 10, 2024

louis030195 commented Oct 10, 2024

EzraEllette commented Oct 10, 2024

louis030195 commented Oct 14, 2024

EzraEllette commented Oct 14, 2024

louis030195 commented Oct 15, 2024 • edited Loading

EzraEllette commented Oct 15, 2024

louis030195 commented Oct 15, 2024

louis030195 commented Oct 4, 2024 •

edited

Loading

algora-pbc bot commented Oct 4, 2024 •

edited

Loading

EzraEllette commented Oct 7, 2024 •

edited by algora-pbc bot

Loading

EzraEllette commented Oct 10, 2024 •

edited

Loading

louis030195 commented Oct 15, 2024 •

edited

Loading