This repo is a simple example how to use TesseractOCR to extract text from image docs.
It provides some preprocessing functionalites for images: (Denoising, Removing lines and Fixing rotation).
And Postprocessing for the text includes some basic text cleaning functions.
Make sure that you have installed Tesseract on your device.
apt-get install libleptonica-devapt-get install tesseract-ocr libtesseract-dev
pip install poetry
For another installation options see
git clone https://github.com/ammarali32/SimpleOCR.gitcd SimpleOCR poetry install
To run the package you have a command line interface:
poetry run python run.py --input="./coding_test/samples/oldpaper.jpg" --output="./output.txt" --verboseIt also works in interactive mode.
This is a full tutorial for installation and testing on colab.
File | Class | Method | Input | Output | Comments |
---|---|---|---|---|---|
run.py | - | run | command line | txt file | command-line interface |
io_txt.py | - | read_file | file-path | Images as np.ndarray | input reader |
io_txt.py | - | write_file | file_path | txt file | output writer |
denoise_photos_nn.py | denoisingModel | Constructer | - | - | Model trained on data from Kaggle |
denoise_photos_nn.py | denoisingModel | forward | image as np.ndarray | image as np.ndarray | - |
denoise_photos_nn.py | denoisingModel | load_weights | weights-path | - | - |
config.py | CFG | - | - | - | Some parameters like weights-path and others |
config.py | LOG | - | - | - | Logger parameters and setting |
preprocess.py | PreProcessor | Constructer | CFG | - | - |
preprocess.py | PreProcessor | fix_rotation | image as np.ndarray | image as np.ndarray | in case the image is rotated a little |
preprocess.py | PreProcessor | denoiseAndBinarize | image as np.ndarray | image as np.ndarray | call the denoising model |
preprocess.py | PreProcessor | removeLines | image as np.ndarray | image as np.ndarray | In case the image include lines |
preprocess.py | PreProcessor | preprocess | image as np.ndarray | image as np.ndarray | call all preprocessor functions |
text_recognition.py | textRecognition | constructor | language str | - | default is English |
text_recognition.py | textRecognition | get_text | image as np.ndarray | string text | uses psm 1 for automatic page segmentation with OSD |
postprocess.py | PostProcessor | constructor | CFG | - | - |
postprocess.py | PostProcessor | removeEmptyLines | string text | string text | - |
postprocess.py | PostProcessor | cleanText | string text | string text | remove undesirable chars "not included in CFG.chars" |
postprocess.py | PostProcessor | spellingCheck | string text | string text | Not used but provided to use please uncomment |
postprocess.py | PostProcessor | postprocess | string text | string text | call all postprocessor functions |