-
Notifications
You must be signed in to change notification settings - Fork 1
AI Documentation
AI Module
First Goal: Text Detection
As text detection is task of high difficulty we decided to make use of an available convolutional-recursive neural network model. We decided on making use of Qi Guo and Yuntian Deng Attention-OCR model, implemented by emedvedev. (available at https://github.com/emedvedev/attention-ocr).
Model Architecture
The network firstly runs multiple convolutional layers over the original image in a sliding window approach. Each feature map is afterwards passed into a feed-forward layer as to make use of feature encodings. Afterwards the encoding vectors are passed to the LSTM-based cell from the final layer. The network is dinamically sized, based on the desired maximum input to be inserted into the network. Firstly we used AOCR network to fully analyse a photo sent by mobile devices.
Datasets 1. ICDAR 2013 3. ICDAR 2015 4. IIT5K 5. Synth90k 6. Personal Synthetic Dataset 7. Book-Cover Crawler for allitebooks.com ( dataset was eliminated as perfectly taken images would damage our network’s ability to generalise. We observed that the accuracy decreased by 8% after training )
Currently we faced two issues:
- Innapropriate architecture of the network. By forcing the network to run on images which containted small boxes of text the network was greatly imbalanced towards its input.
- Huge overfits caused by the lack of training data. OCR networks should be fed a lot more information than we were able to provide; The solution which we used to bypass this issue was dividing the problem intro Text Localisation and Text Detection.
Text Localisation
As to help our OCR model we decided to train another neural network model to detect text boxes into our images. Afterwards we cropped the proposed text boxes and fed them to our next neural network. After reading multiple papers and analysing available resources we decided to train an EAST model network.
Model Architecture
Comparing Architectures (EAST: An Efficient and Accurate Scene Text Detector)
Training Results
Early Training Results(2000steps): Train Set Loss : 0.200 Test Set Accuracy0.67
Late Training Results (9000steps): Train Set Loss : 0.0022 Test Set Loss : 0.045
- Text Recognition
To compensate with our lack of data we generated our own synthetical dataset. As the Attention-OCR model was biased towards words from english we decided to crawl a romanian dictionary. Afterwords by making use of image manipulation libraries we randomly placed texts on a plethora of texture while adding gaussian noise / salt and pepper filters/ gaussian filters / booming filters etc. The text was randomly placed in the image as to train the neural network to understand multiple text orientations.
Datasets: 1.ICDAR 2013 2015; 2.SVT-DATASET ( converted into appropriate format) ( Google Street View Challenge); 3.ISI-PPT Dataset;
Training Results
Results :
Early-Training Results :
Train Set Accuracy : 95%
Test Set Accuracy 39.95%
Late-Training Results :
Train Set Accuracy : 99%
Test Set Accuracy 47.06% accuracy
Clearly strong overfit was met. Possible solutions : include L2 Regularization / Add Dropout to the layers. But mostly dataset should be larget.
Second Goal: Recommender System
We decided to implement a collaborative filter neural network. We made use of two embedding layers: an embedding layer used for our users and one user for our books. The embeding layer should be able to extract important features and information only from the ratings of our users. As the dataset was large we did not have any issues regarding data overfit. For our network we used an Adam Optimiszer ( Combining Momentum and Gradient Propagation).
Results
Train Set Loss : 0.62 Test Set Accuracy: 0.67
References
1.EAST: An Efficient and Accurate Scene Text Detector Xinyu Zhou, Cong Yao, He Wen, Yuzhi Wang, Shuchang Zhou, Weiran He, and Jiajun Liang https://arxiv.org/abs/1704.0315
2.TextBoxes++: A Single-Shot Oriented Scene Text Detector Minghui Liao, Baoguang Shi, Xiang Bai https://arxiv.org/pdf/1801.02765
3.Attention-based Extraction of Structured Information from Street View Imagery Zbigniew Wojna ∗ Alex Gorban † Dar-Shyang Lee † Kevin Murphy † Qian Yu † Yeqing Li † Julian Ibarz †