SimpleOCR

This repo is a simple example how to use TesseractOCR to extract text from image docs. It provides some preprocessing functionalites for images: (Denoising, Removing lines and Fixing rotation).
And Postprocessing for the text includes some basic text cleaning functions.

Installation

Tesseract and leptonica

Make sure that you have installed Tesseract on your device.

Linux Installation

apt-get install libleptonica-dev

apt-get install tesseract-ocr libtesseract-dev

Windows Installation

Poetry

pip install poetry

For another installation options see

Project Installation

git clone https://github.com/ammarali32/SimpleOCR.git

cd SimpleOCR poetry install

Testing

To run the package you have a command line interface:

poetry run python run.py --input="./coding_test/samples/oldpaper.jpg" --output="./output.txt" --verbose

It also works in interactive mode.

Installation Tutorial

This is a full tutorial for installation and testing on colab.

Documentation

File	Class	Method	Input	Output	Comments
run.py	-	run	command line	txt file	command-line interface
io_txt.py	-	read_file	file-path	Images as np.ndarray	input reader
io_txt.py	-	write_file	file_path	txt file	output writer
denoise_photos_nn.py	denoisingModel	Constructer	-	-	Model trained on data from Kaggle
denoise_photos_nn.py	denoisingModel	forward	image as np.ndarray	image as np.ndarray	-
denoise_photos_nn.py	denoisingModel	load_weights	weights-path	-	-
config.py	CFG	-	-	-	Some parameters like weights-path and others
config.py	LOG	-	-	-	Logger parameters and setting
preprocess.py	PreProcessor	Constructer	CFG	-	-
preprocess.py	PreProcessor	fix_rotation	image as np.ndarray	image as np.ndarray	in case the image is rotated a little
preprocess.py	PreProcessor	denoiseAndBinarize	image as np.ndarray	image as np.ndarray	call the denoising model
preprocess.py	PreProcessor	removeLines	image as np.ndarray	image as np.ndarray	In case the image include lines
preprocess.py	PreProcessor	preprocess	image as np.ndarray	image as np.ndarray	call all preprocessor functions
text_recognition.py	textRecognition	constructor	language str	-	default is English
text_recognition.py	textRecognition	get_text	image as np.ndarray	string text	uses psm 1 for automatic page segmentation with OSD
postprocess.py	PostProcessor	constructor	CFG	-	-
postprocess.py	PostProcessor	removeEmptyLines	string text	string text	-
postprocess.py	PostProcessor	cleanText	string text	string text	remove undesirable chars "not included in CFG.chars"
postprocess.py	PostProcessor	spellingCheck	string text	string text	Not used but provided to use please uncomment
postprocess.py	PostProcessor	postprocess	string text	string text	call all postprocessor functions

Visualization and Having Fun

References:

https://www.kaggle.com/c/denoising-dirty-documents

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SimpleOCR

Installation

Tesseract and leptonica

Linux Installation

Windows Installation

Poetry

Project Installation

Testing

Installation Tutorial

Documentation

Visualization and Having Fun

References:

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
coding_test		coding_test
weights		weights
README.md		README.md
config.py		config.py
denoise_photos_nn.py		denoise_photos_nn.py
io_txt.py		io_txt.py
poetry.lock		poetry.lock
postprocess.py		postprocess.py
preprocess.py		preprocess.py
pyproject.toml		pyproject.toml
run.py		run.py
text_recognition.py		text_recognition.py

ammarali32/SimpleOCR

Folders and files

Latest commit

History

Repository files navigation

SimpleOCR

Installation

Tesseract and leptonica

Linux Installation

Windows Installation

Poetry

Project Installation

Testing

Installation Tutorial

Documentation

Visualization and Having Fun

References:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages