Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Varnam outputs invalid combinations with chil letters in Malayalam #166

Open
subins2000 opened this issue Dec 7, 2020 · 7 comments · May be fixed by #170
Open

Varnam outputs invalid combinations with chil letters in Malayalam #166

subins2000 opened this issue Dec 7, 2020 · 7 comments · May be fixed by #170

Comments

@subins2000
Copy link
Member

Varnam outputs combinations with chil letters which is invalid in Malayalam. This seems a bug with the Malayalam scheme file (or is it ?). I can't figure out how to fix it in the scheme. This happens with some chill letters at some instances. ൽി, ർി, ള്‍ി ന്‍ി

Samples:

കിളിവാതിൽിൽ (kilivaathilil)
ഇല്ലെങ്കിൽെനിക്ക് (illenkilenikk)

image

@subins2000
Copy link
Member Author

This bug happens for anusvara (m => ം) as well :

$ varnamc -s ml -t undaavumo
ഉണ്ടാവുംോ
ഉന്ദാവുമൊ

@asdofindia
Copy link
Contributor

Interesting. Without any learnings, this is the output:

$> ./varnamc -s ml -t kilivathilil
Token ki, 4
Token li, 4
Token va, 2
Token thi, 4
Token li, 4
Token l, 2
Transliterating kilivathilil
  കിലിവതിലിൽ
$> ./varnamc -s ml -t undavumo
Token u, 1
Token nda, 2
Token vu, 4
Token mo, 4
Transliterating undavumo
  ഉന്ദവുമൊ

So, this has something to do with learnings, perhaps?

@subins2000
Copy link
Member Author

Yup, it has to do with the learning. A large set of files were used for learning and looking up where the error is difficult. Still, varnam should follow the language rules, and an if condition to check if it's a chil letter will fix it.

My opinion is that the tokenization should be made more better. kilivathilil should also give other options with and വാ. Plus reduce the dependency on learning file. The DB is very huge with patterns and words! Instead if it's just a word dictionary that'd be super resource efficient.

@subins2000
Copy link
Member Author

Found the root of the issue. Varnam learnings has the word kilivaathil => കിളിവാതിൽ". When Varanm finds this word, what it does is use the word plus tokenizes the rest of it i.e :

kilivaathilil
kilivaathil -> കിളിവാതിൽ
il -> ിൽ

which gives the result കിളിവാതിൽിൽ

Solution I'm thinking:

  • If the last letter is a chill or anusvaram (m), then
    • pop the last letter
    • Include it in the "rest of string" to tokenize

@asdofindia
Copy link
Contributor

Oh. Nice catch. Seems like a good solution.

How are words that end in ് dealt with?

Like if പത്തരമാറ്റ് is there in word corpus and I type paththaramaattin does it do പത്തരമാറ്റ്ിന്‍ or പത്തരമാറ്റിന്‍?

subins2000 added a commit to subins2000/libvarnam that referenced this issue Mar 19, 2021
…letters in Malayalam

Varnam learnings has the word `kilivaathil => കിളിവാതിൽ`. When Varanm finds this word,
what it does is use the word plus tokenizes the rest of it. This gives chil combinations.
This PR adds a check for chil to replace the ending chil with its root consonant so that
proper grammatical combinations can happen.
@subins2000 subins2000 linked a pull request Mar 19, 2021 that will close this issue
@subins2000
Copy link
Member Author

@asdofindia It does പത്തരമാറ്റിന്‍. There's no bug with virama. AFAIK, virama is inherent in varnam. The specific character mapped for it in varnam malayalam scheme is ~. But for transliterating words without the explicit ~, the virama is inherently appended at the end. If there is a vowel sound coming after, it's used instead of virama.

I changed the solution btw, it's now :

  • If the last letter is a chill or anusvaram (m), then
    • Replace the last chil letter with its root consonant. eg: ൽ with ല.

This is a better solution and less complex than the previous solution.

@subins2000
Copy link
Member Author

subins2000 commented Aug 24, 2021

This bug has been fixed in GoVarnam. GoVarnam also changed Malayalam scheme to have explicit patterns for chil (n_ , l_). This also fixes the bug of including anusavara in between => m_.

sam_bhavam => സംഭവം
kal_vilakk -> കൽവിളക്ക്

https://gitlab.com/subins2000/govarnam/-/issues/2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants