New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Optimize varnam_learn #141

Open

navaneeth opened this issue Feb 11, 2017 · 0 comments

Labels

Member

navaneeth commented Feb 11, 2017

Today when a new word is learned, Varnam does the following:

Identifies all possible patterns
Sometimes patterns are too much, so it skips after a limit
All the patterns and word prefixes are stored to the learnings file.
Varnam stores patterns and words into different schema
When transliterating, varnam looks at patterns table and perform the transliteration

This is inefficient because of the following reasons:

More storage is used because all the patterns are persisted
Some patterns are skipped to restrict the disk usage. This could be important ones
Learned data is not reusable across different schemes in the same language. For eg: if someone uses ml-phonetic and ml-inscript, they need to store the learned data multiple times for each scheme

The following points has to be considered when attempting to solve this:

Performance of transliterate has to be really good. With this change, transliterate will have to do more work in terms of tokenizing and finding all possible paths. So there is a possibility of introducing performance issues. Think about in-memory data store, constant time lookup etc
A new data structure has to be designed to persist the learned data. This has to be space and computation efficient

The text was updated successfully, but these errors were encountered:

navaneeth added the new feature label

navaneeth changed the title ~~Optimized learn~~ Optimize varnam_learn

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment