This is a TensorFlow implementation of a Deep Neural Network for scene text recognition. It is mainly based on the paper "An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition". You can refer to the paper for architecture details. Thanks to the author Baoguang Shi.
The model consists of a CNN stage extracting features which are fed to an RNN stage (Bi-LSTM) and a CTC loss.
This software has been developed on Ubuntu 16.04(x64) using python 3.5 and TensorFlow 1.10. Since it uses some recent features of TensorFlow it is incompatible with older versions.
The following methods are provided to install dependencies:
There are Dockerfiles inside the folder docker
. Follow the instructions inside docker/README.md
to build the images.
You can create a conda environment with the required dependencies using:
conda env create -f crnntf-env.yml
Required packages may be installed with
pip3 install -r requirements.txt
In this repo you will find a model pre-trained on a subset of the Synth 90k
dataset. You can find the data as TensorFlow records in the data
folder. The trained model can be tested with
python tools/test_shadownet.py --dataset_dir data/ --weights_path model/shadownet/shadownet_2017-09-29-19-16-33.ckpt-39999
If you want to test a single image you can do it with
python tools/demo_shadownet.py --image_path data/test_images/test_01.jpg --weights_path model/shadownet/shadownet_2017-09-29-19-16-33.ckpt-39999
First you need to store all your image data in a root folder then you need to supply a txt file named sample.txt
to
specify the image paths (relative to the image data dir) and their corresponding text labels. For example
path/1/2/373_coley_14845.jpg coley
path/17/5/176_Nevadans_51437.jpg nevadans
Second you need to convert your dataset into TensorFlow records, as well as extract the character set, with
python tools/write_text_features --dataset_dir path/to/your/dataset --save_dir path/to/tfrecords_dir --charset_dir path/to/charset_dir
All the training images will be scaled to a fixed size (by default (32, 100, 3)) and the dataset will be divided into
train, test and validation set. Check global_config/config.py
and run python tools/write_text_features.py
for options.
For all the available training parameters, check global_configuration/config.py
, then train your model with
python tools/train_shadownet.py --dataset_dir path/to/your/tfrecords
If you wish, you can add more metrics to the training progress messages with --decode_outputs
, but this will slow
training down. You can also continue the training process from a snapshot with
python tools/train_shadownet.py --dataset_dir path/to/your/tfrecords --weights_path path/to/your/last/checkpoint
After several iterations you can check the tensorboard logs in logs/
. You should see something like:
The sequence distance is computed by calculating the distance between two sparse tensors so the lower the accuracy value is the better the model performs. The training accuracy is computed by calculating the character-wise precision between the prediction and the ground truth so the higher the better the model performs.
Finally, note that it is possible to use multiple config files for different experiments, via the option --config_file
to all scripts.
The original experiment run for 40000 epochs, with a batch size of 32, an initial learning rate of 0.1 and exponential
decay of 0.1 every 10000 epochs. During training the loss
dropped as follows
The distance between the ground truth and the prediction dropped as follows
The model is trained on a subset of Synth 90k. It would make sense to train on the whole dataset to get a more robust model, since the crnn model needs a large amount of training data in order to achieve good performance.