A language-agnostic spellchecker and corrector based on Hidden Markov Models. The mistakes are discovered using the concept of probabilistic edit distance and the correction using the Viterbi alghoritm on a special kind of HMM. Both the concept of probabilistic distance, the generalized HMM used in the corrector and the search space reduction are novel ideas in the field.
The generalization of the underlying MM:
It provides a Web Interface, a CLI interface and a ready-to-import class with methods.
The Web Interface of the project:
Authors: Cristian Baldi - Simone Vitali
This work was developed as the final project of the course "Probabilistic Models" at University of Milano Bicocca.
git clone
this repopip install -r requirements.txt
python correct.py
for command line usagepython web_interface.py
for a web bases interface
The projects include a model able to correct italian sentences.
- Create a folder inside
data/
and name it whatever you want - Add some
.txt
files to it - Change the current model in
config.py
- Run
python learn.py
andpython build_model.py
- You can now use it.
python performance_tests.py
data/
, data for building the model and running the correction alghoritmweb_interface
, files for the web interfacebuild.py
, build the required objects for running the modelbuild_test_set.py
, build the test set for running testsconfig.py
, configuration filecorrect.py
, command line interfacelearn.py
, analyze the text and build word distributions and transitions probabilityprobabilistic_distance.py
, compute the probabilistic distance between two stringsperformance_test.py
, run testsViterbi.py
, implements the viterbi algorithm and state selection functionsweb_interface.py
, runs the web interface
For the purpose of the project it was used a dataset with italian tweets available here: https:// datacloud [DOT] di.unito [DOT] it/index.php/s/Wn8tRFyETxZkqJc