Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Being able to recognize a subset of spoken Hindi words through offline models #285

Open
GautamR-Samagra opened this issue Jan 20, 2024 · 7 comments
Assignees

Comments

@GautamR-Samagra
Copy link
Collaborator

GautamR-Samagra commented Jan 20, 2024

Task :
Create an offline alternative to Google's read along app in Hindi. It should be able to show a set of words and be able to determine if you have spoken the word correctly or not.

There are 2 approaches we have taken for this :

  • Determine the words spoken by an Offline transcription model and check if the 'actual word' (i.e. word supposed to be read) matches with it or not.
  • Maintain vector representation the required set of words and match the recorded words against them { no work on this yet- just spitballing}

Checking Offline transcription models - Vosk :

  • Data :
    We went on ground and collected a bit of data (20 mins of children speaking the required words) and have collected them here -
    S2T PoC Content.xlsx

  • Actual transcription of the audio data:
    However the transcripts of the audio will not match the 'paragraph' that is read as the students often repeat the words multiple times to get it right. Hence , I have transcribed all the audios using Conformer (Bhashini) to get better quality transcribed output.
    Have collated the transcripts of the audio here - base64_and_transcripts.xlsx

  • Looking at small transcription models:
    We tested out Vosk - the smallest accurate transcription model we could find and have collated their accuracy here
    Collab to use vosk on wav files and get accuracy is here

  • Vosk accuracy and next steps :
    We found that Vosk struggles to recognize the word correctly when the recorded audio has too much noise (which is often in our case). Hence a track on trying to fine-tune Vosk with our data to see if that improves the transcription.
    Fine tuning of Vosk models is covered by them in this ticket

  • Other approaches :
    We also took a crack at trying out whisper tiny for the same with the goal of quantizing it later for mobile use. Whisper tiny is 150 MB and we would ideally like our model to be around ~50 MB.
    However, whisper tiny didn't recognize Hindi and other Hindi whisper tiny models gave very poor transcriptions. This is done here

Figuring out vector representation of required words :

This is something we haven't tried yet. The idea is that the sheet shared earlier already gives us a list of Hindi words that we need to match the recordings with. So if we are to use some Acoustic word encodings to enocde them in vector form and then use directly for matching against encoded recordings, that should be good enough for our use case. You can contribute details on what would be the next steps to follow here.

@xorsuyash
Copy link
Collaborator

Hey @GautamR-Samagra looking forward to collaborate and contribute to this project, can you please assign it to me .

@xorsuyash
Copy link
Collaborator

xorsuyash commented Jan 20, 2024

@GautamR-Samagra please allow me the access of the above linked sheets .

@GautamR-Samagra
Copy link
Collaborator Author

GautamR-Samagra commented Jan 21, 2024

Hey @GautamR-Samagra looking forward to collaborate and contribute to this project, can you please assign it to me .

Thanks for trying to contribute :)
I don't want to assign it yet,do raise a PR once you are able to contribute and I'll assign it to you.

@xorsuyash
Copy link
Collaborator

xorsuyash commented Jan 21, 2024

@GautamR-Samagra can i get the access of the audio samples , it would be very helpful with audio samples to try out vosk fine tuning and also to try vector embedding approach using speechtovec models .

@GautamR-Samagra
Copy link
Collaborator Author

@GautamR-Samagra can i get the access of the audio samples , it would be very helpful with audio samples to try out vosk fine tuning and also to try vector embedding approach using speechtovec models .

I have given you access to the sheet. Have also collated the audios separately in a folder here

@GautamR-Samagra
Copy link
Collaborator Author

@xorsuyash thanks for pointing out that Speechtovec as an embeddings approach for this doesn't make sense as its finally trained on semantics. Will an 'Acoustic word embedding' model make more sense here (like this ) ?

@xorsuyash
Copy link
Collaborator

xorsuyash commented Jan 26, 2024

@GautamR-Samagra Acoustic word embedding will help us to cluster same words spoken by different speakers, still trying figure out way to fine tune vosk . one way acoustic word embedding can help us is by clustering the words spoken by different speaker to get an estimate of spoken word by similarity measure and then will be help full to predict that word .
https://colab.research.google.com/drive/1sWgS9JBsaqf7q_936PkTKSrnHLZfWNiS?usp=sharing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants