Indian Government Issued PAN Card Optical Character Recognition (OCR) Project

The purpose of this WIP Project is to efficiently extract the text contained in a PAN Card image and store it in a JSON. Herein, we are using the following libraries. The current version 2.0 has been run effectively in October, 2018. Any recommendations are welcome. We also need to understand the limitations of pytesseract as it won't run on noisy images with salt & pepper grains and/or poor image quality, i.e. anything below 300 DPI. More information can be found here

The algorithm has been tested with both good quality images and of images with poor quality. The accuracy of the information extracted depends highly on the resolution of the image and the quality of the image. While Tesseract performs well on near perfect images with little or no noise, it fails in more tricky situations specially where there's reflective light on the surface of the PAN card or twists/turns etc. There are a couple of versions in this compendium repository:

Implemented on PyTesseract
Implemented using Google API Setup your Cloud Services
Implemented using OCR.Space More info here

Note: While using the API Versions of this program, kindly make sure that the image size that you're using for detection is less than 1 MB, with file size exceeding the threshold the program will render cold

IDE and list of libraries used:

PyCharm Community Edition running Python 3.6
Pillow
pytesseract
cv2
re
json
ftfy
os
argparse
nostril

Usage

Each component in this repository has specific tasks, explained as follows:

crop_morphology.py Usage is described in the file itself. What this section does is that it crops the image to an area where it just finds textual information. For instance, if it is a scanned copy of a PAN with white background. It will crop it till where it detects the border of the PAN Card. Command: python crop_morphology.py image_pan.jpg
deskew.py The intuition can be found at this link about implementing deskwing and why is it important when an image to text conversion is involved. Given an image containing a rotated block of text at an unknown angle, we need to correct the text skew by:
1. Detecting the block of text in the image.
2. Computing the angle of the rotated text.
3. Rotating the image to correct for the skew.
We typically apply text skew correction algorithms in the field of automatic document analysis, but the process itself can be applied to other domains as well. Command: python deskew.py image_pan.jpg
morph_final.py An alternate version to crop_morphology.py in case the efficiency drops. More info can be found here Command: python morph_final.py image_pan.jpg
morph_interactive.py A playground to morph images as per your need, cycling with various parameters found here Command: python morph_interactive.py image_pan.jpg

Press 1 & 2 to cycle through the different modes and the CV window will show the sliders to adjust the intensity of the preprocessing steps involved.

Note: You will need to save the image as per your need. Tesseract is not a one-stop-shop for all OCR needs, especially for PAN Cards that differ on case to case basis.
json2csv.py Once you have converted all the files into their respective extracted JSONs, you can export them into a CSV for analysis and other usage.

Command: python json2csv.py jsons output.csv

Note: jsons is the folder name and not to be specified as \jsons, the program will automatically treat the folder specified to be in the directory of the program itself. In case output.csv is not written into the disk, create a flat-file with the same name which will be empty and there will be no write errors.
ocr_v2.py Contrary to the name, this is the current functional program to extract text from the image post all steps of pre-processing.
ocr_main.py Uses OCR Space API to extract text from image.
google_vision.py Uses Google Vision API to extract text from image.
preprocess_v2.py More information on this version of preprocessing can be found here which is based on the paper Font and Background Color Independent Binarization. For optimum accuracy prior to running the image through the Tesseract Engine, kindly run this file.

Command: python preprocess_v2.py input.jpg output.jpg

Preprocessing Commands

Usage: `python ocr_v2.py -i image_pan.jpg -p command`

Command	Context	Explanation
`thresh`	Linear Threshold	First, you pick a threshold value, say 127. If the pixel value is greater than the threshold, it becomes black. If less, it becomes white. OpenCV provides us with different types of thresholding methods that can be passed as the fourth parameter. I often use binary threshold for most tasks, but for other thresholding methods you may visit the official documentation.
`adaptive`	Adaptive Threshold	There are two adaptive methods for calculating the threshold value. While Adaptive Thresh Mean returns the mean of the neighborhood area, Adaptive Gaussian Mean calculates the weighted sum of the neighborhood values.
`linear`	Image Resizing	Faster image resizing
`cubic`	Image Resizing	You may need to scale your image to a larger size to recognize small characters. In this case, INTER_CUBIC generally performs better than other alternatives, though it’s also slower than others.
`blur`	Median Blur	In Median Blurring the central element in the kernel area is replaced with the median of all the pixels under the kernel. Particularly, this outperforms other blurring methods in removing salt-and-pepper noise in the images. Median blurring is a non-linear filter. Unlike linear filters, median blurring replaces the pixel values with the median value available in the neighborhood values. So, median blurring preserves edges as the median value must be the value of one of neighboring pixels
`gauss`	Gaussian Blur	Gaussian Blurring works in a similar fashion to Averaging, but it uses Gaussian kernel, instead of a normalized box filter, for convolution. Here, the dimensions of the kernel and standard deviations in both directions can be determined independently. Gaussian blurring is very useful for removing — guess what? — gaussian noise from the image. On the contrary, gaussian blurring does not preserve the edges in the input.
`bilateral`	Bilateral Filtering	Speaking of keeping edges sharp, bilateral filtering is quite useful for removing the noise without smoothing the edges. Similar to gaussian blurring, bilateral filtering also uses a gaussian filter to find the gaussian weighted average in the neighborhood. However, it also takes pixel difference into account while blurring the nearby pixels. Thus, it ensures only those pixels with similar intensity to the central pixel are blurred, whereas the pixels with distinct pixel values are not blurred. In doing so, the edges that have larger intensity variation, so-called edges, are preserved.

Accuracy Matrix

Way Forward

PyTesseract and the Tesseract Engine has many flaws when it comes to converting image to text, especially if the image is noisy and/or contains salt and pepper noise. This can be overcome later by implementing image classification algorithms using LSTMs for better accuracy.

This is a WIP Project because eventually I plan on expanding the program into a Flask application that would be able to extract information from PAN Card, Aadhar Card, Voter ID Card, Driving License and Indian Passport with a UI feature.

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
Capture.PNG		Capture.PNG
Capture_2.PNG		Capture_2.PNG
Capture_3.PNG		Capture_3.PNG
box.py		box.py
box_stats.py		box_stats.py
common.py		common.py
crop_morphology.py		crop_morphology.py
deskew.py		deskew.py
deskewing.py		deskewing.py
google_vision.py		google_vision.py
json2csv.py		json2csv.py
morph_final.py		morph_final.py
morph_interactive.py		morph_interactive.py
ocr_main.py		ocr_main.py
ocr_preprocess.py		ocr_preprocess.py
ocr_v2.py		ocr_v2.py
pan_card_detect.py		pan_card_detect.py
preprocess_v2.py		preprocess_v2.py
readme.md		readme.md
s-1.jpg		s-1.jpg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Indian Government Issued PAN Card Optical Character Recognition (OCR) Project

IDE and list of libraries used:

Usage

Preprocessing Commands

Usage: `python ocr_v2.py -i image_pan.jpg -p command`

Accuracy Matrix

Way Forward

This is a WIP Project because eventually I plan on expanding the program into a Flask application that would be able to extract information from PAN Card, Aadhar Card, Voter ID Card, Driving License and Indian Passport with a UI feature.

About

Releases

Packages

Languages

hanumanraje/PAN_Card_OCR_Project

Folders and files

Latest commit

History

Repository files navigation

Indian Government Issued PAN Card Optical Character Recognition (OCR) Project

IDE and list of libraries used:

Usage

Preprocessing Commands

Usage: python ocr_v2.py -i image_pan.jpg -p command

Accuracy Matrix

Way Forward

This is a WIP Project because eventually I plan on expanding the program into a Flask application that would be able to extract information from PAN Card, Aadhar Card, Voter ID Card, Driving License and Indian Passport with a UI feature.

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Usage: `python ocr_v2.py -i image_pan.jpg -p command`

Packages