A language processing tool for Sinhalese (සිංහල).
Update 2019.07.21: This tool no longer requires java to run sinhala tokenizer. All java code is ported to Python implementation for convenience.
Steps-
- Create new folder named
bin
in root path - Download
stat.split.pickle
to thebin
folder - Import required tools from the
sinling
module in your desired project (you may have to append this project path to your path environment variable)
from sinling import SinhalaTokenizer
tokenizer = SinhalaTokenizer()
sentence = '...' # your sentence
tokenizer.tokenize(sentence)
Sinhala tokenizer also includes tokenizer.split_sentences(...)
function for splitting sentences.
from sinling import preprocess, word_joiner
w1 = preprocess('මුනි')
w2 = preprocess('උතුමා')
results = word_joiner.join(w1, w2)
# Returns a list of possible results after applying join rules ['මුනිතුමා', ...]
from sinling import word_splitter
word = '...'
results = word_splitter.split(word)
# Returns a dict containing debug information, base word and affix
Visit here to see some sample splits.
project
.
+-- README.md
+-- sinling
│ +-- ...
+--scripts
│ +-- rules_exmple.py
│ +-- evaluate.py
+--docs
+-- existing_work
│ +-- sinhala_alphabet.xls
│ +-- helabasa
│ +-- ...
- Contact
[email protected]
if you would like to contribute to this project.
This project is still in work in progress status. Use at Your Own Risk.