ICDAR 2019 Robust Reading Challenge on Scanned Receipts OCR and Information Extraction
This code is based on SSD PyTorch tutorial for our specific task. Thank Sagar Vinodababu for permission and support.
Download the ICDAR-SROIE dataset:
- 2019 ICDAR-SROIE (542MB)
- Task 1 - Scanned Receipt Text Localisation
We use SSD300 as our backbone. Since in OCR task, there is only one label class [text], we classify our boxes into two classes [background, text]. We split our training set into a training set and a testing set.
Run [boxing.py
] and we can see the goal of this task
There are several tricks with SSD in OCR:
1. Tackle the scale variation with ROI finding.
Since the scanned receipts have varied resolutions and SSD requires a limited input size 300*300, we find the ROI of each receipt to help focus on the content.
2. Improve the Non-Maximum Suppression.
With only the traditional NMS (NMS 1), if we want to have an accurate but clean result, we have only two hyperparameters [min score, max_overlap] to enhance performance. If we set min score higher, then many correct boxes will be suppressed because the confidence of correct classification is not so high expecially for testing images. If we set min score lower, then lots of empty boxes will come in and they are stubborn. As for max overlap, we tend to lower it for clear segementation. However, it omits our goal that we want to pretain longer boxes instead of shorter ones. It is possible that the box containing only a part of the words group, having higher confidence, will suppress the longer boxes with lower confidence.
In this case, we add two more NMS based on our statistic: For a specific words group, longer boxes with more words have higher sum of pixel values then shorter boxes with less words. Compared with boxes with content, boxes without content have higher pixel value average since the background is white.
split_train_test.py
: Split raw images and '.txt' into a training set and a testing set.create_data_lists.py
->utils.crete_data_lists()
->utils.parse_annotation()
: Create '.json' files to store ID of images, objects {boxes & labels & texts} and labels (num of classes) .train.py
: Define the main procedure of training.train.py
->datasets.ICDARDataset()
: Read '.json' files to get datasets, apply data preprocessing withutils.transform()
and store in a Dataset class to be used in a PyTorch DataLoader to create batches.train.py
->model.py
: Define SSD model and its MultiBox loss function.
detect.py
: Define the main procedure of testing (single arbitrary image), including data preprocessing and annotation.detect.py
->model.py
: Define SSD model anddetect_objects()
function with non-maximum suppression.
eval.py
:Define the main procedure of evaluation (predefined testing dataLoader).eval.py
->model.py
: Define SSD model anddetect_objects()
function with non-maximum suppression.eval.py
->utils.calc_f1()
: Calculate F1 score.
Open [split_train_test.py
] and set train_test_ratio. In this case it is set to 4.
In .\ICDAR_Dataset
two files train1
and test1
are created. Images and labels are split into these two files in sequence.
Run [create_data_lists.py
] in order to pack all the image paths, objects and labels into json files for futher operations. These json files are named TEST_images.json
TEST_objects.json
label_map.json
TRAIN_images.json
TRAIN_objects.json
.
Run [train.py
] to train an end-to-end text detection model. In this case we use Adam as optimizer.
Run [detect.py
] to detect an test image with pretrained model with minimum validation loss.
Run [eval.py
] to evaluate the performance of the model based on F1 scores.