Seeking Guidance on Custom Urdu ASR Training Data and Vocabulary Expansion #4900

Shaukataliii · 2024-01-03T18:36:04Z

Hello,
I am a developer working on a project involving the development of an Urdu Automatic Speech Recognition (ASR) system using the Kaldi ASR toolkit. I am encountering two specific challenges and would greatly appreciate your insights.

Challenges

Acquiring Transcriptions for Custom Urdu Dataset:

Issue: Obtaining accurate transcriptions for a substantial custom Urdu language dataset, tailored for industry-specific use, has proven challenging.
Request: Seeking guidance or suggestions on cost-effective solutions or resources that could assist in obtaining accurate transcriptions.

Optimizing Kaldi ASR for Recognizing Unseen Words:
- Issue: We aim to optimize the Kaldi ASR model to efficiently recognize new words it may encounter during inference, especially industry-specific jargon.
- Request: Looking for insights or recommendations on approaches to handle previously unseen words and enhance the model's adaptability.

Thank you for your time and consideration.

judyfong · 2024-10-08T11:31:08Z

For two I recommend looking at the icelandic althingi recipe a bit. https://github.com/cadia-lvl/althingi-asr We use sub word modeling and also fst and regular expressions through Thrax. We're looking to merge the recipe into kaldi-asr soon.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Seeking Guidance on Custom Urdu ASR Training Data and Vocabulary Expansion #4900

Seeking Guidance on Custom Urdu ASR Training Data and Vocabulary Expansion #4900

Shaukataliii commented Jan 3, 2024

judyfong commented Oct 8, 2024

Seeking Guidance on Custom Urdu ASR Training Data and Vocabulary Expansion #4900

Seeking Guidance on Custom Urdu ASR Training Data and Vocabulary Expansion #4900

Comments

Shaukataliii commented Jan 3, 2024

judyfong commented Oct 8, 2024