DeepErase

DeepErase is a U-net-like tensorflow sementic segmenation model removing artifacts (lines, boxes, spurious words) from text images extracted from documents. The cleansing of the artifacts enhances OCR performance over the image extractions.

Authors

Yike Qi, Ronny Huang

Abstract

We present a method to programmatically generate artiﬁcial text images with realistic-looking artifacts, and use them to train the U-net-like model in a totally unsupervised manner.
The U-net-like model was trained in two modes:
- Standalone training: Optimize at Unet Segmentation loss only.
- Joint training with downstream Recognition model: Optimize at Unet Segmentation loss + recognition CTC loss. To balance image cleaning and recognition performance. ** The RNN CTC based HTR model was used as the recognition model.

Result

Both validation pixel level segmentation accuracies were above 95%.
Downstream recognition performances were evaluated on validation images and IRS extractions. The IRS extractions were extracted from NIST sd02 tax forms, and were not used in model training. The word recognition accuracy were improved and beat the naive Hough cv2 cleaning method.

Requirements

python 3.5 or above
tensorflow 1.12.0
torch 0.4.1
cv2 4.0.0

or simply

docker pull wrhuang/default
or docker pull jdegange/default
with minor further pip install