-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consider switching to Whisper models #75
Comments
Some libraries that can help: https://github.com/vilassn/whisper_android |
I've tried that before, it wasn't fast enough. Marine I'll try it again. ggerganov/whisper.cpp#1070 |
I just found Sayboard while searching for open source voice recognition. I liked two aspects of it: privacy (being able to run the voice recognition offline), and fully open source data models with sources cited. My biggest issue is that I do not trust OpenAI enough to send any information through OpenAI APIs. It's likely that my data is going to be used in further training sets without my consent. I do not want private texts or confidential work emails ending up in an OpenAI dataset. If Sayboard does decide to switch to OpenAI whisper, please provide a way to disable it, and use the default on-device models and vosk. Sayboard recognizes my voice just fine, and I would rather stick with a more transparent fully open source solution. |
As I have stated in the readme, Sayboard will not send your data over the internet. This issue is about using the open source whisper model offline on the device. Anyways, as whisper is quite heavy and does not run in real time, it will be in addition to the current vosk models. |
Hi, are you considering having an option to fully switch over to whisper instead of vosk or are you planning some way to have both of them? |
Both |
Interesting. Any ideas on how to utilize both at the same time? Maybe replacing the vosk transcript with the whisper transcript once that's computed while keeping any user-added punctuation? |
Thank you for the reassurance! I appreciate your commitment to not sending data off-device.
I see! I'm still digging into the papers for OpenAI Whisper to see if they fully cite their data sources like Vosk does. I am still uncomfortable with using on-device models that don't have transparent data sources. Are you planning on allowing the user to disable the Whisper model and just use Vosk? That would be appreciated! And thanks for all your work on Sayboard. My back and the nerves to my hands are healing after a car accident. I had to limit my phone typing until I found Sayboard. So I deeply appreciate your efforts to provide a privacy respecting accessible keyboard. |
Whisper doesn't cite their data source, but it's mostly audio and transcript pairs from the internet, I believe, even from other speech-to-text tools. That's the reason it's so good; it's because its dataset has a lot of variation and data. |
That's an interesting idea. I was planning on implementing it similar to how the models for the different languages are implemented - so that one can scroll through the models using the globe key, where some are Vosk and some are Whisper. |
Ah I see! Thanks, but I think it may not work because doesn't Vosk often mistake one word for two or more words? If that's the case, then it wouldn't work without something like an LLM (or SLM) processing the Vosk output with added punctuation to determine where to put the manual punctuation. Another approach is to just let Whisper itself figure out all the punctuation and just replace it all, but then I don't really think the Vosk output has a purpose since it gets completely replaced anyways so why bother. Maybe the Vosk and Whisper outputs can be compared and since they should have around the same words, Whisper can replace the Vosk output in the places where Vosk is different than it in words. Punctuation is kept, except in places where Whisper replaces the Vosk output with an optional toggle for Whisper to replace the punctuation. |
Should be noted that FUFO uses OpenAI Whisper models fine-tuned with the ACFT method. (https://keyboard.futo.org/voice-input-models) |
I have not had success with whisper on low power devices like phones. The small whisper models perform very poorly even with a bunch of preprocessing on the audio streams. |
Have you tried FUTO Voice Input or my app Transcribro? Whisper tiny.en works decently and pretty fast by using whisper.cpp to run it. |
@soupslurpr I'll try it out, ty :) |
https://github.com/niedev/RTranslator |
Problem
Currently, at least in my experience, it is rare for the app to correctly recognize most words from first try, even under noise-free conditions. Subsequent cleaning up of the text could be best described as tedious. Thus, any significant improvements to the underlying STT model would be of great use.
Proposed solution
OpenAI Whisper is an open-source/open-weights transformer-based multilingual STT and any-to-English translation model. In my experience, even the smaller Whisper models do perform in close range to Google's STT API, with larger ones clearly proving superior on all tests.
Currently, one of the best inference implementations is whisper.cpp, which also does not have too many dependencies. An alternative option can be TensorFlow, since it can better make use of some TPUs, though it might come with a bit more overhead on the CPU. As far as I know, faster-whisper while offering better performance depends on Python, though I might be wrong.
Prior art
Considerations
Whisper is a transformer model and as such uses the CPU heavy attention mechanism. The naive self-attention implementation has O(n²·d + n·d²) computational complexity and O(n²) memory. While modern inference implementations do use optimized techniques that somewhat reduce it, inference does significantly slow down with context size.
While large versions do manage to discern speech in even the more extreme noise environments and across a wider range of accents, smaller ones may often be more prone to certain failure modes, like homonyms or hallucination of certain phrases, so it is necessarily a balancing act. Real-time streaming seems quite far out of reach.
Add to that some encoding latency (which, UX wise, may be more critical, given decoding does support streaming), beam search, and other sources of delay, and it may become problematic for older devices.
If anyone reading this issue can suggest a performant, simple, and high-quality STT model, preferably of a more parallelizable architecture, please do share it in the replies.
Another possible mitigation can be an option to use an external server for inference. Most modern CPUs can already complete both encoding and decoding in acceptable time, especially for shorter snippets. We can use OpenAI's API schema, since that would give users a cheap managed hosting option without having to configure their own server and share it to WAN via Tailscale, private VPN, or port-forwarding. Unless the user operates under no stable network connectivity, this option would be most preferably, though a local fallback would be more than appreciated.
The text was updated successfully, but these errors were encountered: