3. Preparing Train, Test and Validation Split

Split the dataset into Train, Test and Validate

There are multiple configurations that can be used to train the OCR model:

C1 - Mixed : Uses a Combination of Synthetic and Real data.
C2 - Only Synthetic : Trains on only Synthetic data.
C3 - Mixed : Uses a Combination of Synthetic and Real data. Training with synthetic+real (similar to C1) and then fine-tuning on real.

NOTE: Real data has not been made public. You will need real data only for C1 and C3 configurations (you can create your own). You will not need real data for C2 configuration. To deploy C1 and C3 you can use our pre-trained models as described in here

If you are using your own real data, make sure that you have an annotations file with the name "annot_real.txt" in label_data directory. Also, the format of the label is <absolute Image Path+' '+<Corresponding_Label_Text>.

To split the annot_real file into text files, for train, test, and validation, run the following:

python3 scripts/train_test_split.py <fraction for train> <fraction for validation> <fraction for test>

For example, if you want to split the annot_real file into 70% for train, 20% for validation and 10% for test, run the following command.

python3 scripts/train_test_split.py 0.7 0.2 0.1

This will create 5 text files in the label_data directory : annot_realTrain.txt, annot_realTest.txt , annot_realValidation.txt, annot_synthetic_only.txt and annot_mixed.txt . These files will have labels in the format: <Complete_Image_path>+' '+<Corresponding_Label_Text>.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

3. Preparing Train, Test and Validation Split

Split the dataset into Train, Test and Validate

Clone this wiki locally