Vietnamese Word Segmentation tool developed by Vietnamese Natural Language Processing research team - underthesea. The repository gives an end-to-end working example for reading datasets, training machine learning models, and evaluating performance of the models. It can easily be extended to train your own custom-defined models.
Operating Systems: Linux (Ubuntu, CentOS), Mac
Python 3.6
Anaconda
languageflow==1.1.7
Clone project using git
$ git clone https://github.com/undertheseanlp/word_tokenize.git
Create environment and install requirements
$ cd word_tokenize
$ conda create -n word_tokenize python=3.6
$ pip install -r requirements.txt
Make sure you are in word_tokenize
folder and activate word_tokenize
environment
$ cd word_tokenize
$ source activate word_tokenize
$ python word_tokenize.py --text "Chàng trai 9X Quảng Trị khởi nghiệp từ nấm sò"
$ python word_tokenize.py --fin tmp/input.txt --fout tmp/output.txt
Train and test
$ python util/preprocess_vlsp2013.py
$ python train.py \
--train tmp/vlsp2013/train.txt \
--model tmp/model.bin
Predict with trained model
$ python word_tokenize.py \
--fin tmp/input.txt --fout tmp/output.txt \
--model tmp/model.bin
To be updated
Last update: May 2018