-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding catalan language #8
Comments
Hi @ccoreilly, thanks for offering to volunteer! When adding a new language, my first step is to add the phonemes to gruut-ipa. These should be IPA, and I usually just use a Wikipedia page. If you can manually map the CMU phonemes to IPA, that would be great. If you follow the convention here for English, it will be possible for gruut-ipa to convert between the CMU and IPA phonemes automatically.
I created a small script for this. I start by creating a list of words, usually just the words from my lexicon plus a list of frequent words in the language (I have one for Catalan). Make sure to lower-case and de-duplicate the words. Then I create the espeak-ng lexicon like this: ./espeak_word.sh < words.txt > lexicon.espeak.txt After that, converting it to a database is straightforward: python3 -m gruut.lexicon2db --casing lower --lexicon lexicon.espeak.txt --database espeak/lexicon.db I train separate g2p models for IPA and espeak-ng phonemes. See below for instructions on that, and let me know if you have any questions 🙂 G2PRecent versions of gruut aren't using Phonetisaurus at runtime anymore to reduce the runtime dependencies. I'm hoping to add support for reading the g2p FSTs in pure Python, but for now I'm using a different framework. Training still needs Phonetisuarus, however, for initial alignment of the corpus. If you're using my phonetisaurus Python package, you can get this when you train a model: phonetisaurus train --corpus g2p.corpus --model g2p.fst lexicon.txt The g2p.corpus file contains the alignments for all words in the lexicon. You use this to train a model in my new framework like this: python3 -m gruut.g2p train --corpus g2p.corpus --output g2p/model.crf |
Thanks for the thorough response Michael! I have been a bit busy lately but will make time to contribute. |
Hi Michael, i'm trying to add new language and created model.fst and model.corpus with phonetisaurus. python3 -m gruut.g2p train --corpus g2p.corpus --output g2p/model.crf i'm getting error as zsh: killed python3 -m gruut.g2p train --corpus g2p.corpus --output g2p/model.crf that's it. Any idea or troubleshooting steps to get rid of this or any other way to get model.crt ? |
How big is your pronunciation dictionary? Is it eating up all of your memory? |
Thanks for the reply. The corpus file is of 23M size |
@mlrober are you working on Catalan or another language? (I haven't had the time so it'd be great if your questions were specific to the catalan language :) |
Hi Michael,
I'm working on another language however i put a comment on the catalin
language query.
I reduced the file size and it is done.
However, i found the "loss:" parameter contains higher no i mean i have
around 1,78,000 words and is howling all the nos in loss?
Is it intended?
Appreciate for your response.
…On Fri, Nov 5, 2021 at 1:02 PM ccoreilly ***@***.***> wrote:
@mlrober <https://github.com/mlrober> are you working on Catalan or
another language? (I haven't had the time so it'd be great if your
questions were specific to the catalan language :)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#8 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AVPVACB5YQKDZI6C2OIS7BTUKOCCBANCNFSM464HG6KA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
I guess we can consider this thread as "adding a new language" more generally 🙂 @mlrober, can you clarify what "howling all the nos in loss" means? Sorry, I can't quite interpret it 😕 |
Hi Michael,
Sure. I was saying that after completing model training l, I got some
results stating that scores variable is empty and loss variable is having
all no of words.
Here I'm bit confused is the model trained properly or not?
Also, what are the steps we need to follow to train glow TTS model and how
many hours of data required?
Sorry if it goes out of context l.
Kindly let me know
Thanks,
…On Sat, Nov 6, 2021, 02:28 Michael Hansen ***@***.***> wrote:
I guess we can consider this thread as "adding a new language" more
generally 🙂
@mlrober <https://github.com/mlrober>, can you clarify what "howling all
the nos in loss" means? Sorry, I can't quite interpret it 😕
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#8 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AVPVACAL2J4J6SCIQJ5HLBLUKRHOTANCNFSM464HG6KA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
I would like to contribute by adding support for the catalan language to gruut (and gruut-ipa / ipa2kaldi) but I am not sure about the g2p model.
I have a phonetisaurus g2p model which outputs CMU phonemes and the corresponding dictionary, would that suffice or should the model output IPA phonemes? I could maybe manually map the CMU phonemes to IPA and retrain the model.
I have also seen you have extracted g2p models from espeak-ng, how could I do so? Or have you converted a lexicon to its IPA phonetic representation with espeak and then trained a g2p model based on that?
The text was updated successfully, but these errors were encountered: