Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DMP 2024]: Create offline audio-phoetic matching model #313

Open
GautamR-Samagra opened this issue Apr 19, 2024 · 19 comments
Open

[DMP 2024]: Create offline audio-phoetic matching model #313

GautamR-Samagra opened this issue Apr 19, 2024 · 19 comments
Assignees
Labels

Comments

@GautamR-Samagra
Copy link
Collaborator

GautamR-Samagra commented Apr 19, 2024

Offline Alternative to Google's Read Along App in Hindi

Description

Develop an offline application (POC - web) that can display a set of Hindi words and accurately determine if the user has pronounced each word correctly. The app aims to be an educational tool for Hindi language learners, providing instant feedback on their pronunciation.

The application is envisioned as an offline tool similar to Google's Read Along app but specifically for the Hindi language. It should present users with Hindi words and listen to the user's attempt to pronounce these words, providing feedback on the accuracy of their pronunciation.

Approaches for Consideration:

  • Vector Representation of Words: Explore the possibility of maintaining vector representations of the required set of Hindi words. These vectors will be used to match against the vector-encoded recordings of spoken words by the user.
  • Acoustic Word Encodings: Utilize acoustic word encodings to convert the list of Hindi words into a vector form. This encoding will then be used to match against the encoded recordings from users, determining the accuracy of pronunciation.
  • Feedback Mechanism: Implement a feedback system that informs users of the correctness of their pronunciation and offers suggestions or corrections as needed.

Implementation Details:

  • The project requires the creation of a robust and efficient algorithm for converting Hindi words and spoken recordings into vector representations that can be accurately compared.
  • The app should be capable of running offline, necessitating all necessary data and models to be stored locally on the device.
  • User interface design should be intuitive, encouraging users to engage with the app and improve their Hindi pronunciation skills.
  • Consideration should be given to privacy and data security, especially concerning user recordings.

This is an open invitation for contributors to suggest ideas, approaches, and potential technologies that could be utilized to achieve the project goals. Contributions at all stages of development are welcome, from conceptualization to implementation.

Goals & Mid-Point Milestone

  • A repo of small size that is able to infer if a wav file has some predefined words (around 2000)

Sample audio files:

Acceptance Criteria

Being able to create a lite model that is able to detect the subset of words that a child has correctly pronounced.

Mockups/Wireframes

Product Name

Nipun Lakshya App

Organisation Name

SamagraX

Domain

⁠Education

Tech Skills Needed

Machine Learning, Natural Language Processing, Python

Mentor(s)

@GautamR-Samagra

Category

Machine Learning

@ChakshuGautam ChakshuGautam changed the title Create offline audio-phoetic matching model [DMP 2024]: Create offline audio-phoetic matching model Apr 20, 2024
@Azazel0203
Copy link

hello @ChakshuGautam,

The hindi words displayed....what will be its format...like

  1. random words (like word by word making the user to pronounce and then moving on to the next word)
  2. certain paragraphs like in a story form where user can read the paragraph and the model scores.

Also is there some specific corpus of hindi text to be used?

@GautamR-Samagra
Copy link
Collaborator Author

hello @ChakshuGautam,

The hindi words displayed....what will be its format...like

  1. random words (like word by word making the user to pronounce and then moving on to the next word)
  2. certain paragraphs like in a story form where user can read the paragraph and the model scores.

Also is there some specific corpus of hindi text to be used?

the 2nd one. A paragraph that a child can read. Ideally in the UI, would like to show around 2 sentences that the child keeps reading and the paragraph keep scrolling down until fully read.

Have added a sample dataset

@GautamR-Samagra
Copy link
Collaborator Author

hello @ChakshuGautam,
The hindi words displayed....what will be its format...like

  1. random words (like word by word making the user to pronounce and then moving on to the next word)
  2. certain paragraphs like in a story form where user can read the paragraph and the model scores.

Also is there some specific corpus of hindi text to be used?

the 2nd one. A paragraph that a child can read. Ideally in the UI, would like to show around 2 sentences that the child keeps reading and the paragraph keep scrolling down until fully read.

Have added a sample dataset

because this is to check if a person has read correctly or not, model needs to be more based on phonetics of the audio than relying on auto-regressively decoding for next word.

@horoshiny
Copy link

I would like to work on this project

@AbhimanyuSamagra
Copy link

Do not ask process related questions about how to apply and who to contact in the above ticket. The only questions allowed are about technical aspects of the project itself. If you want help with the process, you can refer instructions listed on Unstop and any further queries can be taken up on our Discord channel titled DMP queries. Here's a Video Tutorial on how to submit a proposal for a project.

@Azazel0203
Copy link

Hello @GautamR-Samagra,

I've delved into this use case and stumbled upon some pre-trained models that yield promising results with just a minor bit of fine-tuning. Although I had limited resources on the free version of Colab, I managed to achieve notable improvements.

Attached Image: actual_output vs output_generated

As depicted in the image, there's still some difference between the actual output and the output generated by the model.

My approach involves recording the .wav file and converting it into words, subsequently comparing them against the repository of pre-stored correct words and sentences to derive a score. This initial evaluation phase sets the stage for fine-tuning the model to suit our specific needs.

I would greatly appreciate any feedback or suggestions you may have on refining this approach.

Thank you.

@RohanHBTU
Copy link

hello @GautamR-Samagra @ChakshuGautam ,

I have worked upon the project where we are supposed to implement a read along app in offline mode.
I started with first using Mel-frequency cepstral coefficients to find similarity between speeches to score them as MFCC are computationally efficient.

image

as MFCC may not capture all the aspects of pronunciation and also gives good similarity score with incomplete speech so I am currently tinkering with X-vectors.

Please reply with your valuable feedback. Thank you for your time and consideration.

@RohanHBTU
Copy link

hi @GautamR-Samagra @ChakshuGautam,

after tinkering with X-vectors, I got the following results.

image

It was time consuming and demanded high computation( not suitable for edge devices). In addition to this, it wasn't able to solve the existing problem with MFCCs. So, I will try to setup a workaround to tackle the above issue and I will keep you posted.

@RohanHBTU
Copy link

hi @GautamR-Samagra @ChakshuGautam,

I tried setting up the prototype the other way around and here are the results.

read.along.demo.mp4

this setup works in the offline environment and the score is not perfect because selected sentence contain inconsistent spacing. This model is based upon whisper(openai) and is quite large for edge device, so I will try to reduce the size of the model (*the time taken to predict the score is due to gradio's framework, not related to model). Any feedback would be good for the development of the project, this would mean a lot. Thank you for your time.

@GautamR-Samagra
Copy link
Collaborator Author

@RohanHBTU what did you use to create the Xvectors? Can you mentions which whisper model you used for the last comment?

@RohanHBTU
Copy link

hi @GautamR-Samagra @ChakshuGautam ,

the whisper model was too big in offline envrionment for an edge device even after quantization. So, tried another model which lightweight and low latency.

vosk_demo.mp4

the model is only 42 mb(zipped) and 78 mb after extraction.

@Ashutosh-Gera
Copy link

Hi @GautamR-Samagra, I wish to work on this project as a part of C4GT program. I am a pre-final year student at IIIT Delhi, India and I believe I will able to contribute positively to the project. Since I recently got to know about this program, and the deadline is approaching, could you please give me a clarity on what steps should I take to showcase you my dedication and make my proposal strong?

Furthermore, it'd be great if I could get your discord so that I can work directly under your supervision.

Awaiting your reply.

thank you

@GautamR-Samagra
Copy link
Collaborator Author

  1. Creating an alignment model that is able to take an input of audio-trasncripts combinations of any length and provide as output
    a list of audio-transcript combinations of any set audio length / word length

looking at other force alignment tools here

  • Wav2vec2 --- (1)
  • aeneas *
  1. collate indicsuperb and NLapp datasets (create transcript using wav2vec + any other tool) --- (2)

  2. audio-acoustic model dataset requirements finalised - word audio + word pairs -- (3)

  3. convert 2 into word-pairs by using 1 (or any format required by acoustic embedding model). -- (4)

  4. check out tiny denoisers (ideally less than 20 MB) like https://huggingface.co/qualcomm/Facebook-Denoiser/tree/main

  5. training on a mixture of indic superb and NL dataset with acoustic-embedding based on this -- (5)

  6. create test split from both the steno results (measuring ORF ) and the dataset created above. -- (6)

  7. iterate on improving accuracy

  8. Acoustic model implementations -

  • train on metaphone phonetic conversion instead of the transcripts directly as shown here
  1. word detection for audios - solving for student pausing while speaking the word

  2. model experiments :

  • fine tune and quanitize whisper and measure ORF (oral reading frequency) by setting cutoff on token probabilities - be able to get token probabilities for stream-like whisper and carry out the above

@GautamR-Samagra
Copy link
Collaborator Author

@xorsuyash can you comment here so that I can assign this to you ?

@xorsuyash
Copy link
Collaborator

xorsuyash commented Jun 4, 2024

@GautamR-Samagra

@xorsuyash
Copy link
Collaborator

xorsuyash commented Jun 11, 2024

cc @GautamR-Samagra

Training acoustic word embedding model to optimize audio transcript matching

  • Dataset Preparation

    For training acoustic word embedding model we need a word and its corresponding audio pronunciation, for this we can
    leverage force alignment word by word of large amount of publicly available asr dataset which contains speech and its
    transcription.
    For this we utilized veterbi algo and backtracking which finds the most probable path of characters in the audio frames

    Being able to segment audio word by word here
    Integrated as service in autotune here

    Using indicwave2vec2 and open-mms as a phoneme model for generating logits of the audios.

  • Model Training

    For model training we need audio transcript pair which will be the output of our force alignment pipeline.
    The format of audio transcript data which we are going to use for model training is here

    Initial approach is to use a Bi-Lstm layer which is used to map the audios and transcript into a latent vector space and then we will use objective losses to which trains the model to map the acoustically similar words close in the latent space and also maps acoustically similar audio and its transcript in close in the latent space which then be used to match audio to its correct orthogonal segment.

Image
Model architecture is here

  • TODO

    • Generating high quality word transcript pair.
    • Figuring out ways to sample hard negatives for objective losses.
    • Designing task for target metrics like Average Precission for model to optimize on dev and test set.
  • ###References

Multilingual jointly trained acoustic word embedding model

DMP proposal

@GautamR-Samagra
Copy link
Collaborator Author

@prabakaranc98 here

@GautamR-Samagra
Copy link
Collaborator Author

GautamR-Samagra commented Aug 5, 2024

Weekly Goals

Week 1

  • Force Alignment using Wav2Vec

Week 2

  • Force Alignment using Wav2Vec - Adding to Autotune

Week 3

  • Force Alignment using WhisperX

Week 4

  • Creating dataset for creating audio phonetic model

Week 5

  • Implementation of training loop based on the paper

Week 6

  • First model iteration of audio phontic model

Week 7

  • Fine tuning parameters and performance improvement

Week 8

  • Running inference on streaming/long audio data
  • Fine tuning parameters and performance improvement

Week 9

  • Combining audio phonetic model with inference
  • Fine tuning parameters and performance improvement

Week 10

  • Comparing against current model for ORF setup
  • Fine tuning parameters and performance improvement

Week 11

  • Improving setup and documentation

Week 12

  • Improving setup and documentation

@xorsuyash
Copy link
Collaborator

xorsuyash commented Aug 9, 2024

Weekly Goals

Week 1

Week 2

Week 3

Week 4

  • Creating dataset for creating audio phonetic model

Week 5

Week 6

  • First model iteration of audio phontic model
    Discussion

Week 7

  • Fine tuning parameters and performance improvement

Week 8

  • Running inference on streaming/long audio data
  • Fine tuning parameters and performance improvement
    Inference

Week 9

  • Combining audio phonetic model with inference
  • Fine tuning parameters and performance improvement

Week 10

  • Comparing against current model for ORF setup
  • Fine tuning parameters and performance improvement

Week 11

  • Improving setup and documentation

Week 12

  • Improving setup and documentation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants