Read Like Humans: Autonomous, Bidirectional and Iterative Language Modeling for Scene Text Recognition
This repo contains train files of ABINet model on 94 characters. The pretrained and finetuned model weights are provided below.
Following are the model configaration:
- Trained on ST and MJ with 94 characters containing lower, upper and sepcial characters.
- Maximum length of the label is 35
- The language model is trained for special chacters as well
- The dataset for Language Model can be found here : WikiText
- This dataset is a modified version of WikiText with 2M added words with special characters. It contains 50% Lowercase, 35% Capitalized and 15% Uppercase.
- Change the paths in train_abinet_custom.yaml for retraining
- Link : Checkpoint for Vision, Language and Alignment model
ABINet uses a vision model and an explicit language model to recognize text in the wild, which are trained in end-to-end way. The language model (BCN) achieves bidirectional language representation in simulating cloze test, additionally utilizing iterative correction strategy.
-
We provide a pre-built docker image using the Dockerfile from
docker/Dockerfile
-
Running in Docker
$ [email protected]:FangShancheng/ABINet.git $ docker run --gpus all --rm -ti --ipc=host -v "$(pwd)"/ABINet:/app fangshancheng/fastai:torch1.1 /bin/bash
-
(Untested) Or using the dependencies
pip install -r requirements.txt
-
Training datasets
- MJSynth (MJ):
- Use
tools/create_lmdb_dataset.py
to convert images into LMDB dataset - LMDB dataset BaiduNetdisk(passwd:n23k)
- Use
- SynthText (ST):
- Use
tools/crop_by_word_bb.py
to crop images from original SynthText dataset, and convert images into LMDB dataset bytools/create_lmdb_dataset.py
- LMDB dataset BaiduNetdisk(passwd:n23k)
- Use
- WikiText103, which is only used for pre-trainig language models:
- Use
notebooks/prepare_wikitext103.ipynb
to convert text into CSV format. - CSV dataset BaiduNetdisk(passwd:dk01)
- Use
- MJSynth (MJ):
-
Evaluation datasets, LMDB datasets can be downloaded from BaiduNetdisk(passwd:1dbv), GoogleDrive.
- ICDAR 2013 (IC13)
- ICDAR 2015 (IC15)
- IIIT5K Words (IIIT)
- Street View Text (SVT)
- Street View Text-Perspective (SVTP)
- CUTE80 (CUTE)
-
The structure of
data
directory isdata ├── charset_36.txt ├── evaluation │ ├── CUTE80 │ ├── IC13_857 │ ├── IC15_1811 │ ├── IIIT5k_3000 │ ├── SVT │ └── SVTP ├── training │ ├── MJ │ │ ├── MJ_test │ │ ├── MJ_train │ │ └── MJ_valid │ └── ST ├── WikiText-103.csv └── WikiText-103_eval_d1.csv
Get the pretrained models from BaiduNetdisk(passwd:kwck), GoogleDrive. Performances of the pretrained models are summaried as follows:
Model | IC13 | SVT | IIIT | IC15 | SVTP | CUTE | AVG |
---|---|---|---|---|---|---|---|
ABINet-SV | 97.1 | 92.7 | 95.2 | 84.0 | 86.7 | 88.5 | 91.4 |
ABINet-LV | 97.0 | 93.4 | 96.4 | 85.9 | 89.5 | 89.2 | 92.7 |
- Pre-train vision model
CUDA_VISIBLE_DEVICES=0,1,2,3 python main.py --config=configs/pretrain_vision_model.yaml
- Pre-train language model
CUDA_VISIBLE_DEVICES=0,1,2,3 python main.py --config=configs/pretrain_language_model.yaml
- Train ABINet
CUDA_VISIBLE_DEVICES=0,1,2,3 python main.py --config=configs/train_abinet.yaml
Note:
- You can set the
checkpoint
path for vision and language models separately for specific pretrained model, or set toNone
to train from scratch
CUDA_VISIBLE_DEVICES=0 python main.py --config=configs/train_abinet.yaml --phase test --image_only
Additional flags:
--checkpoint /path/to/checkpoint
set the path of evaluation model--test_root /path/to/dataset
set the path of evaluation dataset--model_eval [alignment|vision]
which sub-model to evaluate--image_only
disable dumping visualization of attention masks
Integrated into Huggingface Spaces 🤗 using Gradio. Try out the Web Demo:
python demo.py --config=configs/train_abinet.yaml --input=figs/test
Additional flags:
--config /path/to/config
set the path of configuration file--input /path/to/image-directory
set the path of image directory or wildcard path, e.g,--input='figs/test/*.png'
--checkpoint /path/to/checkpoint
set the path of trained model--cuda [-1|0|1|2|3...]
set the cuda id, by default -1 is set and stands for cpu--model_eval [alignment|vision]
which sub-model to use--image_only
disable dumping visualization of attention masks
Successful and failure cases on low-quality images:
If you find our method useful for your reserach, please cite
@article{fang2021read,
title={Read Like Humans: Autonomous, Bidirectional and Iterative Language Modeling for Scene Text Recognition},
author={Fang, Shancheng and Xie, Hongtao and Wang, Yuxin and Mao, Zhendong and Zhang, Yongdong},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year={2021}
}
This project is only free for academic research purposes, licensed under the 2-clause BSD License - see the LICENSE file for details.
Feel free to contact [email protected] if you have any questions.