This repo is directly inspired by Baoguang Shi et al.'s CRNN (CNN+LSTM) paper published in 2015. The novel neural network architecture introduced in this paper could be the foundation of modern OCR technology. There are numerous text recognition repos in GitHub after this paper, which are more or less the adaption of the CNN+LSTM architecture, including this one.
I have been using both conventional OCR (OpenText) and deep-learning-based OCR (Tesseract and AWS Textract) quite a long time. The former, I like to call it as a conventional OCR, simply because I want to distinguish it from the modern deep-learning-based OCR. The CNN-based OCR outperforms the conventional one with higher accuracy and less image pre-processing.
Think about the famous MNIST handwritten digit recognition problem. If you build a Logistic Regression model (softmax), probably you will get an accuracy of around 93%. Applying a Feed-Forward Neural Network will boost the accuracy of up to 98%. However, a convolutional neural network could push the accuracy up to >99% easily.
The conventional OCR extracts characteristics out of each isolated shape and then assigns a symbol. With feature extraction, the bitmap of each symbol was broken up into a set of characteristics, such as lines, strokes, curves, loops, etc. Rules were then applied to find the closest symbol. The attached is an example of a detailed terminology available to describe the "geography" of a letter form.
One big benefit of using the convolutional neural network is the automated feature extraction. This works very well in image-related recognition and classification.
However, before the actual character recognition, there is a very challenging part, called character segmentation, separating the various letters of a word. If you look at the next snapshot, you will see what I mean. Some letters are touching and even degraded. It is not a easy to segment individual letters out. It is also mission impossible to recognize those degraded letters!
However, the character segmentation can be avoided if the OCR engine uses word recognition with an artificial neural network. After all, separating a word of the text line is much easier than separating individual letters of a word. But why word recognition, rather than character recognition? It is because of the particular advantages of the novel CRNN architecture mentioned in the paper. The CNN+LSTM architecture is specifically designed for sequence-like object recognition in images. It can learn directly words without detailed character annotation or segmentation.
My philosophy to Machine Learning and Artificial Intelligence is that if you want the machine to predict the data more accurately, you had better let it “see” it. This sounds a little bit of “cheating”. But it is the truth. In machine learning, it is very common that the new model works pretty well at the beginning after the deployment. However, it becomes worse and worse as time going. There is nothing wrong with the model. It is the data because new data are not similar to the training data pool. Back to the text recognition, I developed a word recognition model first, trained on millions of synthetic word images. It achieves >99% accuracy and works pretty well on regular text images (like book pages, newspaper, etc.). When I applied the model on business documents, its performance drops. Why? Because those training synthetic word images are obtained from regular and clean text images.
The text recognition is relatively static. You won’t see big changes in text styles. The training data are cheap and accessible, no matter synthetic or real text images. Developing a new OCR model won’t take a long time. This is why I conduct this research, developing a customized OCR for some business documents. My goal is to achieve a comparable and even higher recognition rate on some business documents than the AWS Textract. The word recognition is the first step in this research. The next step is to conduct document layout analysis, including font style, fonts size, line, cell, box, table, and block. All of these could be very indicative and discriminative features, used for building robust models in downstream.
Mind you, machine learning is not about the machine’s “intelligence”, it is all about automation.
- https://arxiv.org/abs/1507.05717
- https://github.com/bgshih/crnn
- https://github.com/sbillburg/CRNN-with-STN/blob/master/Batch_Generator.py
- https://github.com/weinman/cnn_lstm_ctc_ocr
- https://www.tsgrp.com/2019/02/12/amazon-textract-and-opentext-capture-recognition-engine-recostar-comparison
- https://www.how-ocr-works.com/intro/intro.html
A pre-trained model (two files) was saved in google drive. Please put them to './best_model'.
A sample of synthetic word images (50K) was included for playing only. They are not good enough to achieve a high recognition rate in real application. The pre-trained model was trained on millions of synthetic and real word images. If you like to have more samples for your research, please contact me ([email protected]).
Given an text image and its Textract ocr, you can quickly check the model performance. However, there are still some improvements which need to be done, such as image pre-processing and spell check. Removing lines, increasing the contrast, de-noising, etc.
$ python ./test/compare_textract.py --image ./test/test_1/test_1.png --response ./test/test_1/apiResponse.json --output ./test/test_1^C
A simple version of document layout analysis was updated. The current version was used for detecting the word images in a document. However, an investigation and research are needed. This will be in a new repository, in which a better document layout extraction will be developed, extracting more useful information than texts and coordinates. If you are interested, please join.
$ python serve.py --image ./test/test_1/test_1.png --output ./test/test_1
Spell check is usually used for the OCR post-processing. However, it seems to be easily over used. It works well for text-only recognition tasks. In other tasks, the spell check should be adapted for the specific use case. For example, we can consider the spell check on key words for ML/AI applications. Therefore, effors should be focused on improving the CRNN accuracy.