HTR

Training HTR

Convert Data to format usable for Loghi Framework

docker run --rm -v \$SRC/:\$SRC/ -v \$tmpdir:\$tmpdir docker.loghi-tooling /src/loghi-tooling/minions//target/appassembler/bin/MinionCutFromImageBasedOnPageXMLNew -input_path \$SRC -outputbase \$tmpdir/imagesnippets/ -output_type png -channels 4 -threads 4

Replace $SRC with the dir where images and page are. This means images in one dir and in that dir a folder named "page" containing the pagexml-files. The corresponding page needs to follow the convention IMAGEFILENAME.xml where IMAGEFILENAME.jpg is the imagefile. (This is Transkribus default)

Replace $tmpdir with the path to the directory where you want the output-results.

Now we should have the text lines as separate image-files. A convenience script that makes life easier is provided:

create_train_data.sh

Just run it and follow the instructions.

The generated training data can be used for the Loghi Framework (and with some alterations for PyLaia, this is work to be done).

Create and train a neural network

In general you can use the default settings and only provide the training list and validation list.

When you have little data or don't care about training time and want the best results use:

--random_width: augments the data by stretching and squeezing the input textline horizontally

--elastic_transform: augments the data by a random elastic transform

You do you need a bit more epochs, but it will be worth it. Especially with little data.

Little data: use a lower batch_size:

--batch_size 2

Tons of data: use a higher batch_size:

--batch_size 24

If you run out of memory during training: lower the batch size or decrease the size of your network by using less layers and units.

Or stick with something like 4 with a lower learning rate and more epochs to get a better result

In general: more epochs = better results. Only the validation scores are really interesting. The loss and CER on the training are not as relevant.

To improve results during inference or validation increase the beam_width:

--beam_width 10 (or higher)

This slows down the decoding process, but will improve the results in general.

The recurrent part of the network is where the magic happens and tiny parts are combined into a transcription. There are parameters for this part you can change.

The number of layers. I haven't tried more than 5 or less than 3. 3 seems fine for most cases.

--rnn_layers 3

The number of neural network units. Higher can mean better, but too much and the network overfits easily. Numbers tried with working results: 128, 256, 512, 1024

--rnn_units 256

To avoid overfitting you can use dropout in the rnn-layers. This means the network will learn the features more robustly at the expense of longer training time. This is a must when training on smaller datasets or if you want the best results.

--use_rnn_dropout

In general you want this turned on.

There are two rnn-types to use: LSTM or GRU. The default is LSTM and if you to use GRU use:

--use_gru

Advanced: To reuse an existing model without changing the recurrent layers you can use:

--freeze_recurrent_layers

Advanced: Make sure to unfreeze them later in fine-tune training using:

--thaw

Advanced:

--freeze_dense_layers

Advanced:

--freeze_conv_layers

Advanced: multiply training data. This just makes one epoch run on more of the same training data. You can use this with tiny datasets when you don't want the overhead of each validation run.

--multiply 1

To reuse an existing model you can use:

--existing_model MODEL_NAME_HERE

Make sure to add

--charlist MODEL_NAME_HERE.charlist

In the current version you yourself need to make sure to store the charlist as the correct filename. THIS IS NOT DONE AUTOMATICALLY YET.

Training

--train_list needs a reference to a file that contains the training data. You can use multiple training files. The argument should look something like:
"/path_to_file/file.txt /path_to_other_file/file.txt"

To validate a model use:
--do_validate
And provide a
--validation_list LIST_FILE

Inferencing works similar but requires a results file:

--results_file RESULT_FILE

Where the results are to be stored. These results can later be used to attach text to individual lines in the PageXML.

A typical training command for training from scratch looks like this:

docker run -v /scratch:/scratch -v /scratch/tmp/output:/src/src/output/
--gpus all --rm -m 32000m --shm-size 10240m -ti docker.htr python3.8
/src/src/main.py --do_train --train_list
training_all_ijsberg_tiny_train.txt --validation_list
training_all_ijsberg_tiny_val.txt --channels 4 --batch_size 4
--epochs 10 --do_validate --gpu 0 --height 64 --memory_limit 6000
--use_mask --seed 1 --beam_width 10 --model new9 --rnn_layers 3
--rnn_units 256 --use_gru --decay_steps 5000
--batch_normalization --output_charlist output/charlist.charlist
--output output --charlist output/charlist.charlist
--use_rnn_dropout --random_width --elastic_transform

Notice in the above docker command that output will be stored in the local disk on:
/scratch/tmp/output

Two "tiny" lists are provided. These contain 1000 random training and 1000 random validation lines from the ijsberg dataset.

training_all_ijsberg_tiny_train
training_all_ijsberg_tiny_val

For trying out stuff these are really useful.

You can use several preset configs for the neural networks:

--model new9
Is very similar to Transkribus' PyLaia models.

--model new10
Has larger conv-layers which can be beneficial to especially larger model. It will slow down training time, but increase accuracy if you have a large dataset. Do not use for smaller data-sets.

--model new11:
Larger model with optional dropout in the final dense layer. This should improve results, but is largely untested. In addition add

--use_dropout for the dropout to be activated.

A typical training command for training using a base model looks like this:

docker run -v /scratch:/scratch -v /scratch/tmp/output:/src/src/output/
--gpus all --rm -m 32000m --shm-size 10240m -ti docker.htr python3.8
/src/src/main.py --do_train --train_list
training_all_ijsberg_tiny_train.txt --validation_list
training_all_ijsberg_tiny_val.txt --channels 4 --batch_size 4
--epochs 10 --do_validate --gpu 0 --height 64 --memory_limit 6000
--use_mask --seed 1 --beam_width 10 --model new9 --rnn_layers 3
--rnn_units 256 --use_gru --decay_steps 5000
--batch_normalization --output_charlist output/charlist.charlist
--output output --charlist EXISTING_MODEL.charlist
--use_rnn_dropout --random_width --elastic_transform
--existing_model EXISTING_MODEL

You can use the above command if all characters of the new data were also in the previous dataset.

If not you should add
--replace_final_layer

Advanced: freeze existing layers and thaw later:

Add these and run for 1 epoch
--freeze_conv_layers --freeze_recurrent_layers
--replace_final_layer --epochs 1

Next remove freeze & replace parameters and add
--thaw

And continue with more epochs.

Inferencing data

Inferencing data means using the trained models to create a transcription.

For this a convenience script "na-pipeline.sh" is provided.

Postprocessing

Region detecting and cleaning (rule based)

See the "na-pipeline.sh" for an example.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HTR

HTR