An open source subtitling platform 💻 for transcribing videos/audios in Indic languages and translating subtitles as well using ML models.
This project is participating in Open Source AI Hackathon sponsored by Meta.
Theme: AI FOR IMAGE GENERATION/CREATIVES
The number of tools which are available in Indian languages to subitle audio and videos in Indian languages are almost none. Yet that shouldn't be the case, as now there are lot of open-source models supporting speech transcription in most of official Indian languages. This tool can be useful for subtitling audios and videos like Indian cinemas for Media Industry in general.
Due to advent of new technologies like Meta's seamless M4T model and Whisper fine-tuned models, you can do speech transcription to transcribe audio's from source audio to source text. With this a Hindi audio can be transcribed to Hindi text for generating subtitles. Meta's seamless M4T model also supports translation which can take Hindi audio and generate subtitle in languages like English, French, Malayalam etc.
- Breaks language barriers, making content accessible to diverse audiences
- Empowers content creators with easy-to-use subtitling in multiple Indian languages
- Enhances viewer experience with accurate, timely subtitles
- Now content-creators, can create youtube videos in their native language like Tamil and create captions in languages like English, Hindi, Malayalam etc. with our tool.
- Can create educational content for doctors practising commmunity medicine, can be used in apps for schools. Like a content in English can be translated to Telgu, the mother tongue of student so they can understand things quickly
- Can be used for media professionals to subtitle news content, movies etc.
Our novel architecture introduced with this project for Generative UI which works for any ASR models
A. SeamlessM4T model
We are planning to use Meta's Seamless Communication technology which was recently released in github [1]. The SeamlessM4T_v2_large
model 🚀, supports almost 12 Indic languages [2] by default. With this model alone, we can potentially transcribe audio in respective languages and even translate subtitles into other languages. More details about SeamlessM4T can be found in paper [7]. The functionality is very well explained in this tutorial [8] written in Seamless Communication Repository.
In lot of Indic languages, there are fine-tuned Whisper ASR models in respective languages. More such models can be found in this Whisper event leaderboard [3]. We have personally fine-tuned Whisper models in my mother tongue malayalam like [4] and [5]. So if performance of any language is not really good in SeamlessM4T model, we can switch to one of the fine-tuned Whisper ASR based models available in open source or make one ourselves. Yet one thing to note though is, that Whisper might not be able to support all the languages listed in Seamless.
Indic Languages supported with SeamlessM4T
Language | Code |
---|---|
Assamese | asm |
Bengali | ben |
English | eng |
Gujarati | guj |
Hindi | hin |
Kannada | kan |
Malayalam | mal |
Marathi | mar |
Odia | ory |
Punjabi | pan |
Tamil | tam |
Telugu | tel |
Urdu | urd |
The language code abbrevation for each of the models can be found out here [6].
B. faster-whisper
faster-whisper [9] is a reimplementation of OpenAI's Whisper model using CTranslate2, which is a fast inference engine for Transformer models. This implementation is up to 4 times faster than openai/whisper for the same accuracy while using less memory. The efficiency can be further improved with 8-bit quantization on both CPU and GPU. Since faster-whisper is based in Whisper, it supports all the 99 languages supported by whisper.
Indic Languages supported with faster-whisper
Language | Code |
---|---|
Assamese | as |
Bengali | bn |
English | en |
Gujarati | gu |
Hindi | hi |
Kannada | kn |
Malayalam | ml |
Marathi | mr |
Punjabi | pa |
Tamil | ta |
Telgu | te |
Urdu | ur |
C. WhisperX
WhisperX provides fast automatic speech recognition (70x realtime with large-v2) with word-level timestamps and speaker diarization. The features provided by WhisperX are:
- ⚡️ Batched inference for 70x realtime transcription using whisper large-v2
- 🪶 faster-whisper backend, requires <8GB gpu memory for large-v2 with beam_size=5
- 🎯 Accurate word-level timestamps using wav2vec2 alignment
- 👯♂️ Multispeaker ASR using speaker diarization from pyannote-audio (speaker ID labels)
- 🗣️ VAD preprocessing, reduces hallucination & batching with no WER degradation
Indic Languages supported with faster-whisper
Language | Code |
---|---|
English | en |
Hindi | hi |
Telgu | te |
Urdu | ur |
D. fine-tuned Whisper model
In certain languages, Whisper by default is not performing strongly. In your problem, the open source Whisper model doesn’t give good results. Then fine-tune your ASR model with examples like Fine-Tune Whisper For Multilingual ASR with 🤗 Transformers.
Indic Languages supported with fine-tuned Whisper model
Language | Code |
---|---|
Malayalam | ml |
We plan to use FastAPI as the backend and deploy it on serveless platforms like Modal.com or any other alternatives.
API format
- POST request for the webendpoints:
generate_seamlessm4t_speech
,generate_faster_whisper_speech
,generate_whisperx_speech
API with the following input format:
{
"wav_base64": "Audio in base64 format",
"target": "Your target lanugage you want to transcribe or translate your audio"
}
- POST request for the functions:
youtube_generate_seamlessm4t_speech
,youtube_generate_faster_whisper_speech
,youtube_generate_whisperx_speech
API with the following input format:
{
"yt_id": "Youtube ID as input in string format",
"target": "Your target lanugage you want to transcribe or translate your audio"
}
Next.js, being a React framework, offers you all the benefits of React plus more features out of the box, such as file-based routing and API routes, which can simplify your development process. It's an excellent choice, especially for a web application that requires server-side rendering (SSR) or static site generation (SSG) for better performance and SEO.
Framework: Next.js (enables SSR and SSG, improving load times and SEO) Styling: Tailwind CSS or styled-components (for styling with ease and efficiency)
Week 1 🌛
- Create API to use Seamless M4T model
- Start building frontend audio/video upload workflow using Next.js
Week 2 🌓
- Build Landing page for Indic subtitler web app
- Build Dashboard to Upload Files, Generate & Edit subtitles and Download subtitles in .srt format
- Continue creating API to use Seamless M4T v2 model. Seamless Communication by default doesn't support time-stamps. github issue. Trying to find a good work around about this.
GPU's needed: 1 A100 or T4
- Use Silero VAD to chunk audio and use start time/end time of each chunks
We run VAD first through the entire audio to figure out the VAD chunks start and end time, which is stored into an array. Then we loop through all these chunks and run seamlessM4T model on each of them.
Issues with this approach:
Smaller chunks get very little context and becuase of this our model is sometimes not able to transcribe these chunks accurately. We feel for seamless to effectively work, we need says we need atleast each chunk of size 5 seconds and less than 20 seconds.
Audio length | Time |
---|---|
3 minutes | 41.4s |
5 minutes | 1m 42s |
15 minutes | 2m23s |
27 minutes | 4m 45s |
- Completed integrating APIs with Next.JS frontend.
- Build API to handle audio/video part
Week 3 🌗
- Build Streaming API for Seamless M4T models
- Incorporate frontend to make use of streaming API endpoints for Generative UI
- In the Landing page include LICENSE of models; Also add an About us page.
- Add a section on Projects to shows audios uploaded and it's associated results SRT files. Also show the name, created date, file size(optional)
- Include more model families like faster-whisper, whisperX, vegam-Malayalam-whisper etc.
- Evaluate the performance of models in Indic subtitler on custom videos. (Made progress by adding ground truth to English audios)
- Improving the results of SeamlessM4T with GPT models.
- Grouping the chunks received from VAD to approx 30 second long chunks and then passing to Seamless model. (the max cut-off for Seamless is 30 seconds).
- See how we can then break down the longer, more accurate audio chunk to smaller parts with timestamps, again from VAD array
- Try whisper-X model on the whole audio, then compare with the smaller chunks approach made with seamless and then try replacing the timestamped version with the audio from seamless
- Consider breaking down the process into 2 independent steps:
- one for transcription only
- then a separate call to LLM to translate the accurate transcriptions to a target language
Week 4 🌕
- Evaluate the performance of Indic subtitler on various languages
- Audio quality enhancement with Demux
- Information page about best set of models and when to use it.
- Live transcription (Aldrin will try and send a prototype soon)
-
Quality enhancing (parallelly)(using demux to remove background noise and improving the quality of audio t- hereby improving accuracy in transcription)
-
Blog about model selection: (rather than adding complex logic and restrictive condition based on permutations in UI, why don't we add a small hyperlink near the model dropdown to a new page (blog/article) where we just say like: based on our testing, we found the following models give best results: seamless for x, whisperx for y etc.. We can title the blog as something like: "Tips and tricks" or maybe like: How to get the best out of Indic subtitler etc.. maybe. Later we can also go on adding benchmarks or graphs etc after doing evaluations. for now we just need to make a simple page with some text based on our observation. nothing too restrictive or enforcing.. just our gentle suggestions based on which they could try switching models to get best performance based on their audio. this way even if our suggestions does not exactly give best results for them, it'll still be cool since it's mostly open ended/gentle recommendations from our part. [DONE]
-
Odiya language integration (cool to talk about during pitch, and mostly easy to implement as a new route since we already have access to pre-trained model). In fact we could even do this now since it'll be very quick to start off maybe.
-
GPT prompting(kept least priority because, if we get the quality part set using noice reduction, we won't even need gpt usage much since the accuracy would already be pretty good!)
Week 5
- Use faster-whisper instead of seamless by default (DONE)
- Adding demucs as priority
- Maybe consider adding some ui to make the user wait
- Making video
- Making slides
- Add a pen icon or so to ensure that the edit feature exists (or some other feature to show that subtitles can be edited)
- Get 3-4 testimonials from people and add it in landing page.
- Try adding a demo / example in landing page.
- Maybe consider a Small write-up about the live transcription
- add model recommendation page in generate screen (https://indicsubtitler.vercel.app/blog/our-recommendations)
Week 6 onwards 🌕
- Fine-tune ASR models based on performance for respective languages and integrate even whisper-based audio models.
- Build a desktop app similar to webapp for using all the functionalities
Demo Day
- Date: 12th April, 2024
- Venue: Hasure Office, Bangalore
- Youtube Video Link
- Presentation Link
-
Instead of uploading, it would be good to have another option to pass youtube video URLs directly and then do subtitling. (Aravind)
-
Improve the existing transcription accuracy by providing context also along with Input Audio and then post-process with GPTs (Simrat)
-
We should ideally focus on doing one thing really well. We were discussing the two features with mentors:
Our first feature is about speech to text Subtitling in both source language and translating to other indic language. The second idea is to generate speech output in different language in a live streaming like setup
They said try to build one thing really well and then only go to the next feature. (Bharat, Aravind)
-
Add more ASR models, instead of SeamlessM4T only(Bharat)
-
Fine tune ASR models if needed (Bharat)
- [1] https://github.com/facebookresearch/seamless_communication
- [2] https://seamless.metademolab.com/source_languages
- [3] https://huggingface.co/spaces/whisper-event/leaderboard
- [4] https://huggingface.co/kurianbenoy/Malwhisper-v1-medium
- [5] https://huggingface.co/collections/kurianbenoy/vegam-whisper-models-65132456b4a3c844a7bf8d8e
- [6] https://github.com/facebookresearch/seamless_communication/blob/main/demo/expressive/utils.py#L2-L103
- [7] Seamless M4T paper - https://arxiv.org/abs/2308.11596
- [8] https://github.com/facebookresearch/seamless_communication/blob/main/Seamless_Tutorial.ipynb
- [9] https://github.com/SYSTRAN/faster-whisper