Extra Instructions for Added-On Features

New features include:

Real-Time Video Text Recognition
and .... more stuffs in the future!

Note:

Our task is to train the Scene Text Recognition (STR) model such that it can recognise numbers with decimal points (one of the special characters), or floating point numbers. However, there is no proper dataset online that has images with floating point numbers. As a result, there is a need to generate our own floating point numbers dataset and to train the STR model to recognise them with high confidence score.

We used 3 types of dataset to do the training and validation:

Synthetic Floats generated using OpenCV2
Synthetic Floats generated using TextRecognitionDataGenerator (Special thanks to original author)
Digits from Street View House Numbers (SVHN) 32 x 32 datasets (Stay tune for instructions)

Our training and validation datasets consist only of integers and floating numbers. There is no training on alphabets! Therefore, the pretrained models provided below can only be used for inferring integers and floats.

Synthetic Floats generated using OpenCV2

By using rand_crop_simple.py in the root folder, we can generate random synthetic floats based on:

Fonts
Thickness of fonts
Color of fonts
Scale of fonts
Noises
Blurs

The floats generated will be superimposed on a cropped image (base image to be added by you).

How to create Synthetic Floats using OpenCV2:

Create folders:

deep-text-recognition-benchmark/
  custom_datasets/
    converted_custom_data/
      training/
      validation/
      rand_crop_simple.py

Open rand_crop_simple.py.
Refer to the lines after:

if __name__ == '__main__':

Add a base image. Ignore this step if you are using the default landscape.jpg.

image = cv2.imread('<path to your image>')

To create a ground truth .txt when generating the images, set the ground truth text file name at the line:

f = open('<ground truth text file>', 'w+')

Example:

f = open('gt_training.txt', 'w+')

Also, name the generated images at the line:

cv2.imwrite('<name of image>' + str(i) +'.jpg', numbered)

Example:

cv2.imwrite('train' + str(i) +'.jpg', numbered)

Finally, ensure that the ground truth text file records the correct directory to read the images and their respective labels. Edit the line:

f.write('<path to images>' + str(i) + '.jpg\t' + str(formatted_value) + '\n')

Example:

f.write('converted_custom_data/training/train' + str(i) + '.jpg\t' + str(formatted_value) + '\n')

Run rand_crop_simple.py with the arguments:

Numbers of images you want to generate:

-n <no. of images you want to generate>

Example:

-n 10000

Note: By default, the images will be generated in the same directory as rand_crop_simple.py.

Range of decimal places allowable for your floats:

-d <lower decimal limit> <upper decimal limit>

Example:

-d 0 3

Example of command to use:

python rand_crop_simple.py -n 10000 -d 0 3

After generating, move the images into the respective folders depending on how they are named. For example, if the images are to be used for training, move the images into:

deep-text-recognition-benchmark/custom_datasets/training/

Return back to:

deep-text-recognition-benchmark/custom_datasets/

Convert images and ground truth text to lmdb dataset by:

python create_lmdb_dataset.py --inputPath ./ --gtFile converted_custom_data/<ground truth file you named>.txt --outputPath result/<training or validation>

Example:

python create_lmdb_dataset.py --inputPath ./ --gtFile converted_custom_data/gt_training.txt --outputPath result/training/

DONE! If you face any trouble, please message gordonjun2 or zhuoyang125.
Repeat the instructions if you want to create another set of data (eg. validation dataset)

How to use Real-Time Video Text Recognition

If RGB is to be used, run the command:

python video_demo.py --saved_model <path of saved model> --Transformation <follow the model> --FeatureExtraction <follow the model> --SequenceModeling <follow the model> --Prediction <follow the model> --character <characters used in model> --file <path to video file> --rgb

If Monotone is to be used, run the command:

python video_demo.py --saved_model <path of saved model> --Transformation <follow the model> --FeatureExtraction <follow the model> --SequenceModeling <follow the model> --Prediction <follow the model> --character <characters used in model> --file <path to video file>

Example:

python video_demo.py --saved_models/float_models/TPS-ResNet-BiLSTM-CTC-Seed1111/saved_custom_v6_rgb/best_accuracy.pth --Transformation TPS --FeatureExtraction ResNet --SequenceModeling BiLSTM --Prediction CTC --character '.0123456789' --file video_samples/olympics_time.mp4 --rgb

Enjoy!

Special Notes

The pretrained models provided in the original repo (see below) is only valid if you are inferring alphanumeric characters, that is:

0123456789abcdefghijklmnopqrstuvwxyz

The pretrained models provided in the original repo were trained on monotone setting. Do not use --rgb in the arguments.
The pretrained models provided by us (see the next section) is only valid if you are inferring integers or floating numbers characters, that is:

.0123456789

The pretrained models provided by us (see the next section) were trained in both RGB and monotone setting. See the folder names. Use --rgb accordingly.
Strictly, you cannot use a pretrained model on one setting to train or infer images on another setting. See the next pointer if you want to do otherwise.

For example, you cannot use a pretrained model that is trained using:

TPS-ResNet-BiLSTM-Attn, non-RGB

to train/infer images using:

python train.py --saved_model saved_models/TPS-ResNet-BiLSTM-Attn.pth --rgb --Transformation TPS --FeatureExtraction RCNN --SequenceModeling BiLSTM --Prediction CTC --train_data custom_datasets/result/training_v3 --valid_data custom_datasets/result/validation_v3 --batch_size 8

To use a pretrained model that was trained on one setting to train on setting setting (eg. Using TPS-ResNet-BiLSTM-Attn model on TPS-ResNet-BiLSTM-CTC setting with different characters):

python train.py --saved_model saved_models/TPS-ResNet-BiLSTM-Attn.pth --rgb --Transformation TPS --FeatureExtraction RCNN --SequenceModeling BiLSTM --Prediction CTC --train_data custom_datasets/result/training_v3 --valid_data custom_datasets/result/validation_v3 --FT --sensitive --batch_size 8

Also, to replace unwanted characters in the full list of op.character with --sensitive, edit extra opt.character in train.py and replace unwanted characters with the ones you need. You can also remove extra unwanted characters by replacing them with spaces (eg. '?' -> ' ').

Download Pretrained Models Here

For pretrained models that are trained with alphanumeric characters, see the original GitHub repo.
For pretrained models that are trained with integers and floats only, download here.

Stay tune for more updates!!

What Is Wrong With Scene Text Recognition Model Comparisons? Dataset and Model Analysis

Official PyTorch implementation of our four-stage STR framework, that most existing STR models fit into.
Using this framework allows for the module-wise contributions to performance in terms of accuracy, speed, and memory demand, under one consistent set of training and evaluation datasets.
Such analyses clean up the hindrance on the current comparisons to understand the performance gain of the existing modules.

Honors

Based on this framework, we recorded the 1st place of ICDAR2013 focused scene text, ICDAR2019 ArT and 3rd place of ICDAR2017 COCO-Text, ICDAR2019 ReCTS (task1).
The difference between our paper and ICDAR challenge is summarized here.

Updates

Dec 27, 2019: added FLOPS in our paper, and minor updates such as log_dataset.txt and ICDAR2019-NormalizedED.
Oct 22, 2019: added confidence score, and arranged the output form of training logs.
Jul 31, 2019: The paper is accepted at International Conference on Computer Vision (ICCV), Seoul 2019, as an oral talk.
Jul 25, 2019: The code for floating-point 16 calculation, check @YacobBY's pull request
Jul 16, 2019: added ST_spe.zip dataset, word images contain special characters in SynthText (ST) dataset, see this issue
Jun 24, 2019: added gt.txt of failure cases that contains path and label of each image, see image_release_190624.zip
May 17, 2019: uploaded resources in Baidu Netdisk also, added Run demo. (check @sharavsambuu's colab demo also)
May 9, 2019: PyTorch version updated from 1.0.1 to 1.1.0, use torch.nn.CTCLoss instead of torch-baidu-ctc, and various minor updated.

Getting Started

Dependency

This work was tested with PyTorch 1.1.0, CUDA 9.0, python 3.6 and Ubuntu 16.04.
You may need pip3 install torch==1.1.0
requirements : lmdb, pillow, torchvision, nltk, natsort

pip3 install lmdb pillow torchvision nltk natsort

Download lmdb dataset for traininig and evaluation from here

data_lmdb_release.zip contains below.
training datasets : MJSynth (MJ)[1] and SynthText (ST)[2]
validation datasets : the union of the training sets IC13[3], IC15[4], IIIT[5], and SVT[6].
evaluation datasets : benchmark evaluation datasets, consist of IIIT[5], SVT[6], IC03[7], IC13[3], IC15[4], SVTP[8], and CUTE[9].

Run demo with pretrained model

Download pretrained model from here
Add image files to test into demo_image/
Run demo.py (add --sensitive option if you use case-sensitive model)

CUDA_VISIBLE_DEVICES=0 python3 demo.py \
--Transformation TPS --FeatureExtraction ResNet --SequenceModeling BiLSTM --Prediction Attn \
--image_folder demo_image/ \
--saved_model TPS-ResNet-BiLSTM-Attn.pth

prediction results

demo images	TPS-ResNet-BiLSTM-Attn	TPS-ResNet-BiLSTM-Attn (case-sensitive)
	available	Available
	shakeshack	SHARESHACK
	london	Londen
	greenstead	Greenstead
	toast	TOAST
	merry	MERRY
	underground	underground
	ronaldo	RONALDO
	bally	BALLY
	university	UNIVERSITY

Training and evaluation

Train CRNN[10] model

CUDA_VISIBLE_DEVICES=0 python3 train.py \
--train_data data_lmdb_release/training --valid_data data_lmdb_release/validation \
--select_data MJ-ST --batch_ratio 0.5-0.5 \
--Transformation None --FeatureExtraction VGG --SequenceModeling BiLSTM --Prediction CTC

Test CRNN[10] model

CUDA_VISIBLE_DEVICES=0 python3 test.py \
--eval_data data_lmdb_release/evaluation --benchmark_all_eval \
--Transformation None --FeatureExtraction VGG --SequenceModeling BiLSTM --Prediction CTC \
--saved_model saved_models/None-VGG-BiLSTM-CTC-Seed1111/best_accuracy.pth

Try to train and test our best accuracy combination (TPS-ResNet-BiLSTM-Attn) also. (download pretrained model)

CUDA_VISIBLE_DEVICES=0 python3 train.py \
--train_data data_lmdb_release/training --valid_data data_lmdb_release/validation \
--select_data MJ-ST --batch_ratio 0.5-0.5 \
--Transformation TPS --FeatureExtraction ResNet --SequenceModeling BiLSTM --Prediction Attn

CUDA_VISIBLE_DEVICES=0 python3 test.py \
--eval_data data_lmdb_release/evaluation --benchmark_all_eval \
--Transformation TPS --FeatureExtraction ResNet --SequenceModeling BiLSTM --Prediction Attn \
--saved_model saved_models/TPS-ResNet-BiLSTM-Attn-Seed1111/best_accuracy.pth

Arguments

--train_data: folder path to training lmdb dataset.
--valid_data: folder path to validation lmdb dataset.
--eval_data: folder path to evaluation (with test.py) lmdb dataset.
--select_data: select training data. default is MJ-ST, which means MJ and ST used as training data.
--batch_ratio: assign ratio for each selected data in the batch. default is 0.5-0.5, which means 50% of the batch is filled with MJ and the other 50% of the batch is filled ST.
--data_filtering_off: skip data filtering when creating LmdbDataset.
--Transformation: select Transformation module [None | TPS].
--FeatureExtraction: select FeatureExtraction module [VGG | RCNN | ResNet].
--SequenceModeling: select SequenceModeling module [None | BiLSTM].
--Prediction: select Prediction module [CTC | Attn].
--saved_model: assign saved model to evaluation.
--benchmark_all_eval: evaluate with 10 evaluation dataset versions, same with Table 1 in our paper.

Download failure cases and cleansed label from here

image_release.zip contains failure case images and benchmark evaluation images with cleansed label.

When you need to train on your own dataset or Non-Latin language datasets.

Create your own lmdb dataset.

pip3 install fire
python3 create_lmdb_dataset.py --inputPath data/ --gtFile data/gt.txt --outputPath result/

At this time, gt.txt should be {imagepath}\t{label}\n
For example

test/word_1.png Tiredness
test/word_2.png kills
test/word_3.png A
...

Modify --select_data, --batch_ratio, and opt.character, see this issue.

Acknowledgements

This implementation has been based on these repository crnn.pytorch, ocr_attention.

Reference

[1] M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman. Synthetic data and artificial neural networks for natural scenetext recognition. In Workshop on Deep Learning, NIPS, 2014.
[2] A. Gupta, A. Vedaldi, and A. Zisserman. Synthetic data fortext localisation in natural images. In CVPR, 2016.
[3] D. Karatzas, F. Shafait, S. Uchida, M. Iwamura, L. G. i Big-orda, S. R. Mestre, J. Mas, D. F. Mota, J. A. Almazan, andL. P. De Las Heras. ICDAR 2013 robust reading competition. In ICDAR, pages 1484–1493, 2013.
[4] D. Karatzas, L. Gomez-Bigorda, A. Nicolaou, S. Ghosh, A. Bagdanov, M. Iwamura, J. Matas, L. Neumann, V. R.Chandrasekhar, S. Lu, et al. ICDAR 2015 competition on ro-bust reading. In ICDAR, pages 1156–1160, 2015.
[5] A. Mishra, K. Alahari, and C. Jawahar. Scene text recognition using higher order language priors. In BMVC, 2012.
[6] K. Wang, B. Babenko, and S. Belongie. End-to-end scenetext recognition. In ICCV, pages 1457–1464, 2011.
[7] S. M. Lucas, A. Panaretos, L. Sosa, A. Tang, S. Wong, andR. Young. ICDAR 2003 robust reading competitions. In ICDAR, pages 682–687, 2003.
[8] T. Q. Phan, P. Shivakumara, S. Tian, and C. L. Tan. Recognizing text with perspective distortion in natural scenes. In ICCV, pages 569–576, 2013.
[9] A. Risnumawan, P. Shivakumara, C. S. Chan, and C. L. Tan. A robust arbitrary text detection system for natural scene images. In ESWA, volume 41, pages 8027–8048, 2014.
[10] B. Shi, X. Bai, and C. Yao. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. In TPAMI, volume 39, pages2298–2304. 2017.

Links

WebDemo : https://demo.ocr.clova.ai/
Combination of Clova AI detection and recognition, additional/advanced features used for KOR/JPN.
Repo of detection : https://github.com/clovaai/CRAFT-pytorch

Citation

Please consider citing this work in your publications if it helps your research.

@inproceedings{baek2019STRcomparisons,
  title={What Is Wrong With Scene Text Recognition Model Comparisons? Dataset and Model Analysis},
  author={Baek, Jeonghun and Kim, Geewook and Lee, Junyeop and Park, Sungrae and Han, Dongyoon and Yun, Sangdoo and Oh, Seong Joon and Lee, Hwalsuk},
  booktitle = {International Conference on Computer Vision (ICCV)},
  year={2019},
  note={to appear},
  pubstate={published},
  tppubtype={inproceedings}
}

Contact

Feel free to contact us if there is any question:
for code/paper Jeonghun Baek [email protected]; for collaboration [email protected] (our team leader).

License

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Name		Name	Last commit message	Last commit date
Latest commit History 162 Commits
TextRecognitionDataGenerator		TextRecognitionDataGenerator
custom_datasets		custom_datasets
demo_images		demo_images
figures		figures
modules		modules
runs/saved_logs		runs/saved_logs
saved_models		saved_models
video_samples		video_samples
.gitignore		.gitignore
ANW.mp4		ANW.mp4
Dockerfile		Dockerfile
Dockerfile_deploy		Dockerfile_deploy
LICENSE.md		LICENSE.md
README.md		README.md
dataset.py		dataset.py
demo.ipynb		demo.ipynb
demo.py		demo.py
docker-compose.yml		docker-compose.yml
model.py		model.py
olympics.mp4		olympics.mp4
test.py		test.py
train.py		train.py
utils.py		utils.py
video_demo.py		video_demo.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Extra Instructions for Added-On Features

Note:

Synthetic Floats generated using OpenCV2

How to use Real-Time Video Text Recognition

Special Notes

Download Pretrained Models Here

What Is Wrong With Scene Text Recognition Model Comparisons? Dataset and Model Analysis

Honors

Updates

Getting Started

Dependency

Download lmdb dataset for traininig and evaluation from here

Run demo with pretrained model

prediction results

Training and evaluation

Arguments

Download failure cases and cleansed label from here

When you need to train on your own dataset or Non-Latin language datasets.

Acknowledgements

Reference

Links

Citation

Contact

License

About

Releases

Packages

Languages

License

zhuoyang125/deep-text-recognition-benchmark

Folders and files

Latest commit

History

Repository files navigation

Extra Instructions for Added-On Features

Note:

Synthetic Floats generated using OpenCV2

How to use Real-Time Video Text Recognition

Special Notes

Download Pretrained Models Here

What Is Wrong With Scene Text Recognition Model Comparisons? Dataset and Model Analysis

Honors

Updates

Getting Started

Dependency

Download lmdb dataset for traininig and evaluation from here

Run demo with pretrained model

prediction results

Training and evaluation

Arguments

Download failure cases and cleansed label from here

When you need to train on your own dataset or Non-Latin language datasets.

Acknowledgements

Reference

Links

Citation

Contact

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages