Consider switching to Whisper models #75

walking-octopus · 2024-04-21T12:52:50Z

Problem

Currently, at least in my experience, it is rare for the app to correctly recognize most words from first try, even under noise-free conditions. Subsequent cleaning up of the text could be best described as tedious. Thus, any significant improvements to the underlying STT model would be of great use.

Proposed solution

OpenAI Whisper is an open-source/open-weights transformer-based multilingual STT and any-to-English translation model. In my experience, even the smaller Whisper models do perform in close range to Google's STT API, with larger ones clearly proving superior on all tests.

Currently, one of the best inference implementations is whisper.cpp, which also does not have too many dependencies. An alternative option can be TensorFlow, since it can better make use of some TPUs, though it might come with a bit more overhead on the CPU. As far as I know, faster-whisper while offering better performance depends on Python, though I might be wrong.

Prior art

Considerations

Whisper is a transformer model and as such uses the CPU heavy attention mechanism. The naive self-attention implementation has O(n²·d + n·d²) computational complexity and O(n²) memory. While modern inference implementations do use optimized techniques that somewhat reduce it, inference does significantly slow down with context size.
While large versions do manage to discern speech in even the more extreme noise environments and across a wider range of accents, smaller ones may often be more prone to certain failure modes, like homonyms or hallucination of certain phrases, so it is necessarily a balancing act. Real-time streaming seems quite far out of reach.
Add to that some encoding latency (which, UX wise, may be more critical, given decoding does support streaming), beam search, and other sources of delay, and it may become problematic for older devices.

If anyone reading this issue can suggest a performant, simple, and high-quality STT model, preferably of a more parallelizable architecture, please do share it in the replies.

Another possible mitigation can be an option to use an external server for inference. Most modern CPUs can already complete both encoding and decoding in acceptable time, especially for shorter snippets. We can use OpenAI's API schema, since that would give users a cheap managed hosting option without having to configure their own server and share it to WAN via Tailscale, private VPN, or port-forwarding. Unless the user operates under no stable network connectivity, this option would be most preferably, though a local fallback would be more than appreciated.

ElishaAz · 2024-05-05T04:40:09Z

Some libraries that can help:

https://github.com/vilassn/whisper_android
https://github.com/nyadla-sys/whisper.tflite

Janghou · 2024-05-16T10:01:32Z

Or whisper.cpp?
https://github.com/ggerganov/whisper.cpp/tree/master/examples/whisper.android

ElishaAz · 2024-05-16T10:27:28Z

Or whisper.cpp?

I've tried that before, it wasn't fast enough. Marine I'll try it again. ggerganov/whisper.cpp#1070

sagesharp · 2024-06-06T20:11:23Z

I just found Sayboard while searching for open source voice recognition. I liked two aspects of it: privacy (being able to run the voice recognition offline), and fully open source data models with sources cited.

My biggest issue is that I do not trust OpenAI enough to send any information through OpenAI APIs. It's likely that my data is going to be used in further training sets without my consent. I do not want private texts or confidential work emails ending up in an OpenAI dataset.

If Sayboard does decide to switch to OpenAI whisper, please provide a way to disable it, and use the default on-device models and vosk. Sayboard recognizes my voice just fine, and I would rather stick with a more transparent fully open source solution.

ElishaAz · 2024-06-06T20:29:31Z

As I have stated in the readme, Sayboard will not send your data over the internet. This issue is about using the open source whisper model offline on the device.

Anyways, as whisper is quite heavy and does not run in real time, it will be in addition to the current vosk models.

soupslurpr · 2024-06-07T08:16:49Z

Hi, are you considering having an option to fully switch over to whisper instead of vosk or are you planning some way to have both of them?

ElishaAz · 2024-06-07T08:49:27Z

Both

soupslurpr · 2024-06-07T19:06:59Z

Interesting. Any ideas on how to utilize both at the same time? Maybe replacing the vosk transcript with the whisper transcript once that's computed while keeping any user-added punctuation?

sagesharp · 2024-06-08T10:37:02Z

As I have stated in the readme, Sayboard will not send your data over the internet. This issue is about using the open source whisper model offline on the device.

Thank you for the reassurance! I appreciate your commitment to not sending data off-device.

Anyways, as whisper is quite heavy and does not run in real time, it will be in addition to the current vosk models.

I see! I'm still digging into the papers for OpenAI Whisper to see if they fully cite their data sources like Vosk does.

I am still uncomfortable with using on-device models that don't have transparent data sources. Are you planning on allowing the user to disable the Whisper model and just use Vosk? That would be appreciated!

And thanks for all your work on Sayboard. My back and the nerves to my hands are healing after a car accident. I had to limit my phone typing until I found Sayboard. So I deeply appreciate your efforts to provide a privacy respecting accessible keyboard.

soupslurpr · 2024-06-08T16:19:32Z

Whisper doesn't cite their data source, but it's mostly audio and transcript pairs from the internet, I believe, even from other speech-to-text tools. That's the reason it's so good; it's because its dataset has a lot of variation and data.

ElishaAz · 2024-06-08T18:24:27Z

Interesting. Any ideas on how to utilize both at the same time? Maybe replacing the vosk transcript with the whisper transcript once that's computed while keeping any user-added punctuation?

That's an interesting idea. I was planning on implementing it similar to how the models for the different languages are implemented - so that one can scroll through the models using the globe key, where some are Vosk and some are Whisper.

soupslurpr · 2024-06-09T01:19:15Z

That's an interesting idea. I was planning on implementing it similar to how the models for the different languages are implemented - so that one can scroll through the models using the globe key, where some are Vosk and some are Whisper.

Ah I see! Thanks, but I think it may not work because doesn't Vosk often mistake one word for two or more words? If that's the case, then it wouldn't work without something like an LLM (or SLM) processing the Vosk output with added punctuation to determine where to put the manual punctuation.

Another approach is to just let Whisper itself figure out all the punctuation and just replace it all, but then I don't really think the Vosk output has a purpose since it gets completely replaced anyways so why bother.

Maybe the Vosk and Whisper outputs can be compared and since they should have around the same words, Whisper can replace the Vosk output in the places where Vosk is different than it in words. Punctuation is kept, except in places where Whisper replaces the Vosk output with an optional toggle for Whisper to replace the punctuation.

VeH-c · 2024-06-28T21:23:25Z

Should be noted that FUFO uses OpenAI Whisper models fine-tuned with the ACFT method. (https://keyboard.futo.org/voice-input-models)
I think having more model options would be nice.

RangerMauve · 2024-09-12T15:38:26Z

I have not had success with whisper on low power devices like phones. The small whisper models perform very poorly even with a bunch of preprocessing on the audio streams.

soupslurpr · 2024-09-12T15:40:38Z

I have not had success with whisper on low power devices like phones. The small whisper models perform very poorly even with a bunch of preprocessing on the audio streams.

Have you tried FUTO Voice Input or my app Transcribro? Whisper tiny.en works decently and pretty fast by using whisper.cpp to run it.

RangerMauve · 2024-09-12T15:50:10Z

@soupslurpr I'll try it out, ty :)

xingwozhonghua126 · 2024-12-20T12:41:37Z

https://github.com/niedev/RTranslator
This open source software uses Whisper technology and offline models, and the effect is perfect.

ElishaAz mentioned this issue Dec 18, 2024

The following model is recommended to improve accuracy #92

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider switching to Whisper models #75

Consider switching to Whisper models #75

walking-octopus commented Apr 21, 2024 •

edited

Loading

ElishaAz commented May 5, 2024

Janghou commented May 16, 2024

ElishaAz commented May 16, 2024

sagesharp commented Jun 6, 2024

ElishaAz commented Jun 6, 2024

soupslurpr commented Jun 7, 2024

ElishaAz commented Jun 7, 2024

soupslurpr commented Jun 7, 2024

sagesharp commented Jun 8, 2024

soupslurpr commented Jun 8, 2024 •

edited

Loading

ElishaAz commented Jun 8, 2024

soupslurpr commented Jun 9, 2024 •

edited

Loading

VeH-c commented Jun 28, 2024

RangerMauve commented Sep 12, 2024

soupslurpr commented Sep 12, 2024

RangerMauve commented Sep 12, 2024

xingwozhonghua126 commented Dec 20, 2024

Consider switching to Whisper models #75

Consider switching to Whisper models #75

Comments

walking-octopus commented Apr 21, 2024 • edited Loading

Problem

Proposed solution

Prior art

Considerations

ElishaAz commented May 5, 2024

Janghou commented May 16, 2024

ElishaAz commented May 16, 2024

sagesharp commented Jun 6, 2024

ElishaAz commented Jun 6, 2024

soupslurpr commented Jun 7, 2024

ElishaAz commented Jun 7, 2024

soupslurpr commented Jun 7, 2024

sagesharp commented Jun 8, 2024

soupslurpr commented Jun 8, 2024 • edited Loading

ElishaAz commented Jun 8, 2024

soupslurpr commented Jun 9, 2024 • edited Loading

VeH-c commented Jun 28, 2024

RangerMauve commented Sep 12, 2024

soupslurpr commented Sep 12, 2024

RangerMauve commented Sep 12, 2024

xingwozhonghua126 commented Dec 20, 2024

walking-octopus commented Apr 21, 2024 •

edited

Loading

soupslurpr commented Jun 8, 2024 •

edited

Loading

soupslurpr commented Jun 9, 2024 •

edited

Loading