Alternate Approach #171

subins2000 · 2021-03-27T16:22:48Z

Varnam has a tokenizer that converts Malayalam (or any other Indian lang) text to manglish patterns. While learning, Varnam makes a database of such patterns -> word :

Pattern | Word ID | Learned

"mal" "77156" "0"
"mala"  "228" "1"
"mala"  "1586"  "1"
"mala"  "5434"  "1"
"mala"  "50134" "1"
"mala"  "57521" "0"
"malaa" "50134" "1"
"malaa" "57521" "0"
"malaagha"  "7784"  "1"
"malaaghama"  "82823" "0"
"malaaghamaa" "82823" "0"
"malaaghamaar"  "25013" "1"
"malaaghamar" "25013" "1"
"malaak"  "102229"  "1"
"malaaka" "24048" "1"
"malaaka" "43013" "1"

This makes the database huge in size. Varnam makes malayalam suggestions from this database (the learnings database) looking up pattern. If it can't find one, uses the tokenizer to make word.

I want to know why this approach wasn't chosen @navaneeth :

No need of a pattern => word DB (learnings file). Instead, just need a word dictionary.
Add more patterns to VST (Varnam Symbol Table). Prioritized letters n => ന, ണ. Capitalized N will always give ണ. So pani will give suggestions in priority : പനി, പണി. Currently if only the learnings DB has pani assigned to both words will give the different outputs.
When an input say pani is given to varnam, it should tokenize to പനി and പണി using just VST, and then look up the word dictionary to find words starting with പനി and പണി and give additional suggestions.
For english words like "Cricket", the tokenization will give bad results, in such cases we can maybe use a pattern => word DB like the current learnings DB.

By doing so, the size of the learnings database can be reduced a lot.

The text was updated successfully, but these errors were encountered:

navaneeth · 2021-03-30T06:56:41Z

We can experiment with this approach. I don't remember why this wasn't chosen in the first place. May be because tokenizing pani will emit more tokens and the complexity of identifying which is the right pattern out of all combinations could be slow. It will be difficult to do all these and provide real time suggestions matching to the speed of typing.

However, we should really implement a prototype and evaluate and see how it performs.

subins2000 · 2021-04-04T18:52:09Z

I tried to see if multiple suggestions for a pattern from VST can be made, but it's complex. I can't figure out how to do it. There is a VARNAM_MATCH_ALL flag for malayalam -> manglish patterns (which is used in learning) but not for manglish -> malayalam combinations. So this is a dead end for me.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alternate Approach #171

Alternate Approach #171

subins2000 commented Mar 27, 2021 •

edited

Loading

navaneeth commented Mar 30, 2021

subins2000 commented Apr 4, 2021

Alternate Approach #171

Alternate Approach #171

Comments

subins2000 commented Mar 27, 2021 • edited Loading

navaneeth commented Mar 30, 2021

subins2000 commented Apr 4, 2021

subins2000 commented Mar 27, 2021 •

edited

Loading