[DMP 2024]: Create offline audio-phoetic matching model #313

GautamR-Samagra · 2024-04-19T18:50:55Z

Offline Alternative to Google's Read Along App in Hindi

Description

Develop an offline application (POC - web) that can display a set of Hindi words and accurately determine if the user has pronounced each word correctly. The app aims to be an educational tool for Hindi language learners, providing instant feedback on their pronunciation.

The application is envisioned as an offline tool similar to Google's Read Along app but specifically for the Hindi language. It should present users with Hindi words and listen to the user's attempt to pronounce these words, providing feedback on the accuracy of their pronunciation.

Approaches for Consideration:

Vector Representation of Words: Explore the possibility of maintaining vector representations of the required set of Hindi words. These vectors will be used to match against the vector-encoded recordings of spoken words by the user.
Acoustic Word Encodings: Utilize acoustic word encodings to convert the list of Hindi words into a vector form. This encoding will then be used to match against the encoded recordings from users, determining the accuracy of pronunciation.
Feedback Mechanism: Implement a feedback system that informs users of the correctness of their pronunciation and offers suggestions or corrections as needed.

Implementation Details:

The project requires the creation of a robust and efficient algorithm for converting Hindi words and spoken recordings into vector representations that can be accurately compared.
The app should be capable of running offline, necessitating all necessary data and models to be stored locally on the device.
User interface design should be intuitive, encouraging users to engage with the app and improve their Hindi pronunciation skills.
Consideration should be given to privacy and data security, especially concerning user recordings.

This is an open invitation for contributors to suggest ideas, approaches, and potential technologies that could be utilized to achieve the project goals. Contributions at all stages of development are welcome, from conceptualization to implementation.

Goals & Mid-Point Milestone

A repo of small size that is able to infer if a wav file has some predefined words (around 2000)

Sample audio files:

Acceptance Criteria

Being able to create a lite model that is able to detect the subset of words that a child has correctly pronounced.

Mockups/Wireframes

Product Name

Nipun Lakshya App

Organisation Name

SamagraX

Domain

⁠Education

Tech Skills Needed

Machine Learning, Natural Language Processing, Python

Mentor(s)

@GautamR-Samagra

Category

Machine Learning

Azazel0203 · 2024-04-21T07:00:29Z

hello @ChakshuGautam,

The hindi words displayed....what will be its format...like

random words (like word by word making the user to pronounce and then moving on to the next word)
certain paragraphs like in a story form where user can read the paragraph and the model scores.

Also is there some specific corpus of hindi text to be used?

GautamR-Samagra · 2024-04-22T05:45:54Z

hello @ChakshuGautam,

The hindi words displayed....what will be its format...like

random words (like word by word making the user to pronounce and then moving on to the next word)

certain paragraphs like in a story form where user can read the paragraph and the model scores.

Also is there some specific corpus of hindi text to be used?

the 2nd one. A paragraph that a child can read. Ideally in the UI, would like to show around 2 sentences that the child keeps reading and the paragraph keep scrolling down until fully read.

Have added a sample dataset

GautamR-Samagra · 2024-04-22T05:51:03Z

hello @ChakshuGautam,
The hindi words displayed....what will be its format...like

random words (like word by word making the user to pronounce and then moving on to the next word)

certain paragraphs like in a story form where user can read the paragraph and the model scores.

Also is there some specific corpus of hindi text to be used?

the 2nd one. A paragraph that a child can read. Ideally in the UI, would like to show around 2 sentences that the child keeps reading and the paragraph keep scrolling down until fully read.

Have added a sample dataset

because this is to check if a person has read correctly or not, model needs to be more based on phonetics of the audio than relying on auto-regressively decoding for next word.

horoshiny · 2024-04-22T18:01:33Z

I would like to work on this project

AbhimanyuSamagra · 2024-04-23T11:06:40Z

Do not ask process related questions about how to apply and who to contact in the above ticket. The only questions allowed are about technical aspects of the project itself. If you want help with the process, you can refer instructions listed on Unstop and any further queries can be taken up on our Discord channel titled DMP queries. Here's a Video Tutorial on how to submit a proposal for a project.

Azazel0203 · 2024-04-23T19:51:21Z

Hello @GautamR-Samagra,

I've delved into this use case and stumbled upon some pre-trained models that yield promising results with just a minor bit of fine-tuning. Although I had limited resources on the free version of Colab, I managed to achieve notable improvements.

As depicted in the image, there's still some difference between the actual output and the output generated by the model.

My approach involves recording the .wav file and converting it into words, subsequently comparing them against the repository of pre-stored correct words and sentences to derive a score. This initial evaluation phase sets the stage for fine-tuning the model to suit our specific needs.

I would greatly appreciate any feedback or suggestions you may have on refining this approach.

Thank you.

RohanHBTU · 2024-04-25T09:47:02Z

hello @GautamR-Samagra @ChakshuGautam ,

I have worked upon the project where we are supposed to implement a read along app in offline mode.
I started with first using Mel-frequency cepstral coefficients to find similarity between speeches to score them as MFCC are computationally efficient.

as MFCC may not capture all the aspects of pronunciation and also gives good similarity score with incomplete speech so I am currently tinkering with X-vectors.

Please reply with your valuable feedback. Thank you for your time and consideration.

RohanHBTU · 2024-04-25T21:32:37Z

hi @GautamR-Samagra @ChakshuGautam,

after tinkering with X-vectors, I got the following results.

It was time consuming and demanded high computation( not suitable for edge devices). In addition to this, it wasn't able to solve the existing problem with MFCCs. So, I will try to setup a workaround to tackle the above issue and I will keep you posted.

RohanHBTU · 2024-04-28T14:05:03Z

hi @GautamR-Samagra @ChakshuGautam,

I tried setting up the prototype the other way around and here are the results.

read.along.demo.mp4

this setup works in the offline environment and the score is not perfect because selected sentence contain inconsistent spacing. This model is based upon whisper(openai) and is quite large for edge device, so I will try to reduce the size of the model (*the time taken to predict the score is due to gradio's framework, not related to model). Any feedback would be good for the development of the project, this would mean a lot. Thank you for your time.

GautamR-Samagra · 2024-04-29T05:36:51Z

@RohanHBTU what did you use to create the Xvectors? Can you mentions which whisper model you used for the last comment?

RohanHBTU · 2024-04-30T10:12:14Z

hi @GautamR-Samagra @ChakshuGautam ,

the whisper model was too big in offline envrionment for an edge device even after quantization. So, tried another model which lightweight and low latency.

vosk_demo.mp4

the model is only 42 mb(zipped) and 78 mb after extraction.

Ashutosh-Gera · 2024-05-04T17:52:25Z

Hi @GautamR-Samagra, I wish to work on this project as a part of C4GT program. I am a pre-final year student at IIIT Delhi, India and I believe I will able to contribute positively to the project. Since I recently got to know about this program, and the deadline is approaching, could you please give me a clarity on what steps should I take to showcase you my dedication and make my proposal strong?

Furthermore, it'd be great if I could get your discord so that I can work directly under your supervision.

Awaiting your reply.

thank you

GautamR-Samagra · 2024-06-03T09:58:42Z

Creating an alignment model that is able to take an input of audio-trasncripts combinations of any length and provide as output
a list of audio-transcript combinations of any set audio length / word length

looking at other force alignment tools here

Wav2vec2 --- (1)
aeneas *

collate indicsuperb and NLapp datasets (create transcript using wav2vec + any other tool) --- (2)
audio-acoustic model dataset requirements finalised - word audio + word pairs -- (3)
convert 2 into word-pairs by using 1 (or any format required by acoustic embedding model). -- (4)
check out tiny denoisers (ideally less than 20 MB) like https://huggingface.co/qualcomm/Facebook-Denoiser/tree/main
training on a mixture of indic superb and NL dataset with acoustic-embedding based on this -- (5)
create test split from both the steno results (measuring ORF ) and the dataset created above. -- (6)
iterate on improving accuracy
Acoustic model implementations -

train on metaphone phonetic conversion instead of the transcripts directly as shown here

word detection for audios - solving for student pausing while speaking the word
model experiments :

fine tune and quanitize whisper and measure ORF (oral reading frequency) by setting cutoff on token probabilities - be able to get token probabilities for stream-like whisper and carry out the above

GautamR-Samagra · 2024-06-04T08:47:39Z

@xorsuyash can you comment here so that I can assign this to you ?

xorsuyash · 2024-06-04T12:36:46Z

@GautamR-Samagra

xorsuyash · 2024-06-11T13:29:12Z

cc @GautamR-Samagra

Training acoustic word embedding model to optimize audio transcript matching

Dataset Preparation

For training acoustic word embedding model we need a word and its corresponding audio pronunciation, for this we can
leverage force alignment word by word of large amount of publicly available asr dataset which contains speech and its
transcription.
For this we utilized veterbi algo and backtracking which finds the most probable path of characters in the audio frames

Being able to segment audio word by word here
Integrated as service in autotune here

Using indicwave2vec2 and open-mms as a phoneme model for generating logits of the audios.
Model Training

For model training we need audio transcript pair which will be the output of our force alignment pipeline.
The format of audio transcript data which we are going to use for model training is here

Initial approach is to use a Bi-Lstm layer which is used to map the audios and transcript into a latent vector space and then we will use objective losses to which trains the model to map the acoustically similar words close in the latent space and also maps acoustically similar audio and its transcript in close in the latent space which then be used to match audio to its correct orthogonal segment.

Model architecture is here

TODO
- Generating high quality word transcript pair.
- Figuring out ways to sample hard negatives for objective losses.
- Designing task for target metrics like Average Precission for model to optimize on dev and test set.
###References

Multilingual jointly trained acoustic word embedding model

DMP proposal

GautamR-Samagra · 2024-06-12T01:57:23Z

@prabakaranc98 here

GautamR-Samagra · 2024-08-05T12:34:00Z

Weekly Goals

Week 1

Force Alignment using Wav2Vec

Week 2

Force Alignment using Wav2Vec - Adding to Autotune

Week 3

Force Alignment using WhisperX

Week 4

Creating dataset for creating audio phonetic model

Week 5

Implementation of training loop based on the paper

Week 6

First model iteration of audio phontic model

Week 7

Fine tuning parameters and performance improvement

Week 8

Running inference on streaming/long audio data
Fine tuning parameters and performance improvement

Week 9

Combining audio phonetic model with inference
Fine tuning parameters and performance improvement

Week 10

Comparing against current model for ORF setup
Fine tuning parameters and performance improvement

Week 11

Improving setup and documentation

Week 12

Improving setup and documentation

xorsuyash · 2024-08-09T03:24:49Z

Weekly Goals

Week 1

Force Alignment using Wav2Vec
Force Alignment Tool

Week 2

Force Alignment using Wav2Vec - Adding to Autotune
Auto-alignment

Week 3

Force Alignment using WhisperX
WhisperX alignment

Week 4

Creating dataset for creating audio phonetic model

Week 5

Implementation of training loop based on the paper
Full training pipeline

Week 6

First model iteration of audio phontic model
Discussion

Week 7

Fine tuning parameters and performance improvement

Week 8

Running inference on streaming/long audio data
Fine tuning parameters and performance improvement
Inference

Week 9

Combining audio phonetic model with inference
Fine tuning parameters and performance improvement

Week 10

Comparing against current model for ORF setup
Fine tuning parameters and performance improvement

Week 11

Improving setup and documentation

Week 12

Improving setup and documentation

GautamR-Samagra added the C4GT label Apr 19, 2024

ChakshuGautam changed the title ~~Create offline audio-phoetic matching model~~ [DMP 2024]: Create offline audio-phoetic matching model Apr 20, 2024

GautamR-Samagra assigned xorsuyash Jun 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DMP 2024]: Create offline audio-phoetic matching model #313

[DMP 2024]: Create offline audio-phoetic matching model #313

GautamR-Samagra commented Apr 19, 2024 •

edited

Loading

Azazel0203 commented Apr 21, 2024

GautamR-Samagra commented Apr 22, 2024

GautamR-Samagra commented Apr 22, 2024

horoshiny commented Apr 22, 2024

AbhimanyuSamagra commented Apr 23, 2024

Azazel0203 commented Apr 23, 2024

RohanHBTU commented Apr 25, 2024

RohanHBTU commented Apr 25, 2024

RohanHBTU commented Apr 28, 2024

GautamR-Samagra commented Apr 29, 2024

RohanHBTU commented Apr 30, 2024

Ashutosh-Gera commented May 4, 2024

GautamR-Samagra commented Jun 3, 2024

GautamR-Samagra commented Jun 4, 2024

xorsuyash commented Jun 4, 2024 •

edited

Loading

xorsuyash commented Jun 11, 2024 •

edited

Loading

Model Training

TODO

GautamR-Samagra commented Jun 12, 2024

GautamR-Samagra commented Aug 5, 2024 •

edited

Loading

xorsuyash commented Aug 9, 2024 •

edited

Loading

[DMP 2024]: Create offline audio-phoetic matching model #313

[DMP 2024]: Create offline audio-phoetic matching model #313

Comments

GautamR-Samagra commented Apr 19, 2024 • edited Loading

Offline Alternative to Google's Read Along App in Hindi

Description

Approaches for Consideration:

Implementation Details:

Goals & Mid-Point Milestone

Sample audio files:

Acceptance Criteria

Mockups/Wireframes

Product Name

Organisation Name

Domain

Tech Skills Needed

Mentor(s)

Category

Azazel0203 commented Apr 21, 2024

GautamR-Samagra commented Apr 22, 2024

GautamR-Samagra commented Apr 22, 2024

horoshiny commented Apr 22, 2024

AbhimanyuSamagra commented Apr 23, 2024

Azazel0203 commented Apr 23, 2024

RohanHBTU commented Apr 25, 2024

RohanHBTU commented Apr 25, 2024

RohanHBTU commented Apr 28, 2024

GautamR-Samagra commented Apr 29, 2024

RohanHBTU commented Apr 30, 2024

Ashutosh-Gera commented May 4, 2024

GautamR-Samagra commented Jun 3, 2024

GautamR-Samagra commented Jun 4, 2024

xorsuyash commented Jun 4, 2024 • edited Loading

xorsuyash commented Jun 11, 2024 • edited Loading

Training acoustic word embedding model to optimize audio transcript matching

Model Training

TODO

GautamR-Samagra commented Jun 12, 2024

GautamR-Samagra commented Aug 5, 2024 • edited Loading

Weekly Goals

Week 1

Week 2

Week 3

Week 4

Week 5

Week 6

Week 7

Week 8

Week 9

Week 10

Week 11

Week 12

xorsuyash commented Aug 9, 2024 • edited Loading

Weekly Goals

Week 1

Week 2

Week 3

Week 4

Week 5

Week 6

Week 7

Week 8

Week 9

Week 10

Week 11

Week 12

GautamR-Samagra commented Apr 19, 2024 •

edited

Loading

xorsuyash commented Jun 4, 2024 •

edited

Loading

xorsuyash commented Jun 11, 2024 •

edited

Loading

GautamR-Samagra commented Aug 5, 2024 •

edited

Loading

xorsuyash commented Aug 9, 2024 •

edited

Loading