Skip to content

Commit

Permalink
Added more docs
Browse files Browse the repository at this point in the history
  • Loading branch information
navaneeth committed Sep 20, 2020
1 parent cfe6a3d commit b7b65d9
Showing 1 changed file with 72 additions and 2 deletions.
74 changes: 72 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,7 @@ This will take some time depends on how much words you are loading.

[Here are some more word corpus](http://mirror.rackdc.com/savannah/varnamproject/words/)

There is a `--import-learnings-from` option to import files which already has the learnt paramaeter. Importing these files don't take too much time as the word corpus.
There is a `--import-learnings-from` option to import files which already has the learnt parameter. Importing these files don't take too much time as the word corpus.

What next?
==========
Expand All @@ -75,7 +75,77 @@ If you just wanted to use varnam for input, you have the following options
- [Varnam on iBUS](https://github.com/varnamproject/libvarnam-ibus) - For Linux
- [Varnam online editor](https://www.varnamproject.com/editor) - Platform agnostic

If you are a programmer, you will be interested in `libvarnam`. You can use it to provide indian language support in your applications. `libvarnam` can be used from different programming languages.
If you are a programmer, you will be interested in `libvarnam`. You can use it to provide Indian language support in your applications. `libvarnam` can be used from different programming languages.

How Varnam works
================

1. Scheme files and symbol tables
2. Transliteration
3. Learning

## Scheme files and symbol tables

Scheme file maps English letters to phonetic equivalent indic letters. In this, all vowels, consonants and consonant clusters are mapped to the indic equivalent. Varnam uses the scheme file mapping to perform transliteration.

Scheme files are plain text but uses a custom DSL to make the mapping easier. This DSL is implemented using Ruby and it can contain any valid Ruby code. It also provides many helper functions to make the mapping easier.

`schemes/` directory contains all the scheme files for the supported languages. Each language is represented with it's ISO language code.

### Symbol tables

Compiled version of Scheme file is called as *Varnam Symbol Table* (vst). This compilation is done using `varnamc` command line utility

```
varnamc --compile schemes/ml
```

Symbol tables are binary representation of the plain text scheme files. It also contains other metadata items to make the lookup easier.

libvarnam understand only the symbol table format. Because of this, every scheme file should be compiled into *vst* format before it can be used with varnam.

```
make vst
```

can be used to compile all scheme files present in the *schemes* directory.

## Transliteration

```
varnam_transliterate(varnam *handle, const char *input, varray **output);
```

Is the entry point for transliteration. Transliteration converts *input* to the phonetic equivalent indic text. It also provides a set of matches which are possible for the given input.

Transliteration does the following steps under the hood:

Performs tokenization on the *input*. Varnam uses a greedy tokenizer which processes *input* from left to right. Tokenizer tries all possible to combinations to generate the longest possible tokens for the given input. This token will be generated by utilizing the symbol table which is provided to varnam

Generated tokens is assembled and varnam computes all possibilities of these tokens. Assume the input is *malayalam*, varnam generates tokens like, *മ, ല, യാ, ളം ([ma], [la], [ya], [lam])* and many others. Once these tokens are generated, they are combined and tested against the learning model to get rid of garbage values and come up with most used words. Words are sorted according to the frequency value and returned to the caller function.

## Learning

```
varnam_learn(varnam *handle, const char *word);
```

Varnam can learn new words. The more words it learns, the better it performs. Learning process learns the words and it's patterns.

Learning process persists the following data:

1. Patterns: All english combinations which can be used to input the given indic text
2. Words: Indic text itself
3. Prefixes: Prefixes of patterns and words

When an indic word is learned, varnam tokenizes the word using the symbol table and tries to learn all possible patterns that can be used to input the word. Internally, varnam keeps a prefix tree and frequencies of all patterns. This storage structure allows varnam to retrieve matching words efficiently when a pattern is presented. Basic stemming is also performed while learning words.

When the same word/pattern combination is learned, varnam computes frequency at which it has seen this pattern. This frequency is used to sort and pick the best candidate while performing transliteration.

Learning can be initiated by calling Varnam APIs directly or using *varnamc*.

Input tools like ibus-engine will automatically learn the words that you are typing.


Mozilla Public License
======================
Expand Down

0 comments on commit b7b65d9

Please sign in to comment.