Adding catalan language #8

ccoreilly · 2021-06-17T18:43:59Z

I would like to contribute by adding support for the catalan language to gruut (and gruut-ipa / ipa2kaldi) but I am not sure about the g2p model.

I have a phonetisaurus g2p model which outputs CMU phonemes and the corresponding dictionary, would that suffice or should the model output IPA phonemes? I could maybe manually map the CMU phonemes to IPA and retrain the model.

I have also seen you have extracted g2p models from espeak-ng, how could I do so? Or have you converted a lexicon to its IPA phonetic representation with espeak and then trained a g2p model based on that?

synesthesiam · 2021-06-23T17:39:48Z

Hi @ccoreilly, thanks for offering to volunteer!

When adding a new language, my first step is to add the phonemes to gruut-ipa. These should be IPA, and I usually just use a Wikipedia page.

If you can manually map the CMU phonemes to IPA, that would be great. If you follow the convention here for English, it will be possible for gruut-ipa to convert between the CMU and IPA phonemes automatically.

I have also seen you have extracted g2p models from espeak-ng, how could I do so?

I created a small script for this. I start by creating a list of words, usually just the words from my lexicon plus a list of frequent words in the language (I have one for Catalan). Make sure to lower-case and de-duplicate the words. Then I create the espeak-ng lexicon like this:

./espeak_word.sh < words.txt > lexicon.espeak.txt

After that, converting it to a database is straightforward:

python3 -m gruut.lexicon2db --casing lower --lexicon lexicon.espeak.txt --database espeak/lexicon.db

I train separate g2p models for IPA and espeak-ng phonemes. See below for instructions on that, and let me know if you have any questions 🙂

G2P

Recent versions of gruut aren't using Phonetisaurus at runtime anymore to reduce the runtime dependencies. I'm hoping to add support for reading the g2p FSTs in pure Python, but for now I'm using a different framework.

Training still needs Phonetisuarus, however, for initial alignment of the corpus. If you're using my phonetisaurus Python package, you can get this when you train a model:

phonetisaurus train --corpus g2p.corpus --model g2p.fst lexicon.txt

The g2p.corpus file contains the alignments for all words in the lexicon. You use this to train a model in my new framework like this:

python3 -m gruut.g2p train --corpus g2p.corpus --output g2p/model.crf

ccoreilly · 2021-06-29T22:03:58Z

Thanks for the thorough response Michael! I have been a bit busy lately but will make time to contribute.

mlrober · 2021-11-03T15:59:02Z

Hi Michael,

i'm trying to add new language and created model.fst and model.corpus with phonetisaurus.
Howver, when i try to run the below command to get "model.crt" with :

python3 -m gruut.g2p train --corpus g2p.corpus --output g2p/model.crf

i'm getting error as

zsh: killed python3 -m gruut.g2p train --corpus g2p.corpus --output g2p/model.crf

that's it. Any idea or troubleshooting steps to get rid of this or any other way to get model.crt ?

synesthesiam · 2021-11-03T17:28:45Z

How big is your pronunciation dictionary? Is it eating up all of your memory?

mlrober · 2021-11-05T04:55:53Z

Thanks for the reply. The corpus file is of 23M size
Is it too big to train? what would be the ideal size?

ccoreilly · 2021-11-05T07:32:38Z

@mlrober are you working on Catalan or another language? (I haven't had the time so it'd be great if your questions were specific to the catalan language :)

mlrober · 2021-11-05T09:27:17Z

Hi Michael, I'm working on another language however i put a comment on the catalin language query. I reduced the file size and it is done. However, i found the "loss:" parameter contains higher no i mean i have around 1,78,000 words and is howling all the nos in loss? Is it intended? Appreciate for your response.

…

On Fri, Nov 5, 2021 at 1:02 PM ccoreilly ***@***.***> wrote: @mlrober <https://github.com/mlrober> are you working on Catalan or another language? (I haven't had the time so it'd be great if your questions were specific to the catalan language :) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AVPVACB5YQKDZI6C2OIS7BTUKOCCBANCNFSM464HG6KA> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

synesthesiam · 2021-11-05T20:57:50Z

I guess we can consider this thread as "adding a new language" more generally 🙂

@mlrober, can you clarify what "howling all the nos in loss" means? Sorry, I can't quite interpret it 😕

mlrober · 2021-11-06T06:45:37Z

Hi Michael, Sure. I was saying that after completing model training l, I got some results stating that scores variable is empty and loss variable is having all no of words. Here I'm bit confused is the model trained properly or not? Also, what are the steps we need to follow to train glow TTS model and how many hours of data required? Sorry if it goes out of context l. Kindly let me know Thanks,

…

On Sat, Nov 6, 2021, 02:28 Michael Hansen ***@***.***> wrote: I guess we can consider this thread as "adding a new language" more generally 🙂 @mlrober <https://github.com/mlrober>, can you clarify what "howling all the nos in loss" means? Sorry, I can't quite interpret it 😕 — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AVPVACAL2J4J6SCIQJ5HLBLUKRHOTANCNFSM464HG6KA> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

Feature/catalan v3 1

synesthesiam added the enhancement New feature or request label Jun 23, 2021

synesthesiam pushed a commit that referenced this issue Jul 3, 2024

Merge pull request #8 from fedecosta/feature/catalan_v3_1

64a23ab

Feature/catalan v3 1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding catalan language #8

Adding catalan language #8

ccoreilly commented Jun 17, 2021

synesthesiam commented Jun 23, 2021

ccoreilly commented Jun 29, 2021

mlrober commented Nov 3, 2021

synesthesiam commented Nov 3, 2021

mlrober commented Nov 5, 2021

ccoreilly commented Nov 5, 2021

mlrober commented Nov 5, 2021 via email

synesthesiam commented Nov 5, 2021

mlrober commented Nov 6, 2021 via email

Adding catalan language #8

Adding catalan language #8

Comments

ccoreilly commented Jun 17, 2021

synesthesiam commented Jun 23, 2021

G2P

ccoreilly commented Jun 29, 2021

mlrober commented Nov 3, 2021

synesthesiam commented Nov 3, 2021

mlrober commented Nov 5, 2021

ccoreilly commented Nov 5, 2021

mlrober commented Nov 5, 2021 via email

synesthesiam commented Nov 5, 2021

mlrober commented Nov 6, 2021 via email