Getting error when trying to teach network on the base of the default model #227

vlad-wonderkidstudio · 2017-06-13T22:33:02Z

Expected Behavior

If I try to teach the network on the base of the existing model, it should work fine

Current Behavior

If I try to teach the network on the base of the existing model, I always get the following error
(sometimes just after launching the app, sometimes in 10-30 seconds)

Traceback (most recent call last):
  File "/usr/local/bin/ocropus-rtrain", line 289, in <module>
    pcs = network.trainSequence(line,cs,update=do_update,key=fname)
  File "/usr/local/lib/python2.7/dist-packages/ocrolib/lstm.py", line 902, in trainSequence
    self.targets = array(make_target(cs,self.No))
  File "/usr/local/lib/python2.7/dist-packages/ocrolib/lstm.py", line 734, in make_target
    result[2*i+1,j] = 1.0
IndexError: index 156 is out of bounds for axis 1 with size 156

Possible Solution

Steps to Reproduce (for bugs)

Run command ocropus-rtrain --load models/en-default.pyrnn.gz -o my_new_model_name ground/????/*.bin.png
Get an error (sometimes right after executing the command, sometimes in 10-30 seconds)

Your Environment

Python version: Python 2.7.6

Git revision of ocropy:
commit 358df8d
Merge: dacf0fc e016e74
Author: Philipp Zumstein [email protected]
Date: Mon May 22 22:38:33 2017 +0200

Merge pull request Delete unused function bounding_box in ocropus-linegen #219 from tmbdev/del-bbox-func

Delete unused function bounding_box in ocropus-linegen
Operating System and version:
Ubuntu 14.04.1 LTS

The text was updated successfully, but these errors were encountered:

harinath141 · 2017-06-18T13:45:47Z

Try this command:
ocropus-rtrain --load models/en-default.pyrnn.gz -o my_new_model_name ground/????/*.bin.png -S 100 -F200

Beckenb · 2017-06-19T08:56:48Z

This error occurs when you load an existing model and try to teach it characters it has never seen before. In other words, the codec (or chars.py) you are using has characters in it that are not included in the initial training of the en-default model. As far as I know there is no way to expand the codec/character set once training has started.

jze · 2017-08-14T09:10:52Z

I have been able to reproduce the problem with a character included in the existing model's codec. Try to continue training the en-default model with this image:

./ocropus-rtrain --load models/en-default.pyrnn.gz -o test 29265260-aea06d62-80e0-11e7-99f3-d0e061cec2a0.png

The resulting error is IndexError: index 156 is out of bounds for axis 1 with size 156
Or doesn't the en-default model use the default codec?

Beckenb · 2017-10-23T11:25:36Z

I can't say what codec was used to train the en-default model (I don't know if there is any way at all?) - but I highly doubt that they used the default codes, since it includes german and french characters.

mcriggs · 2018-05-10T14:15:05Z

I had a similar issue. To solve this issue for myself, I first interpreted the "size 156" in the error message as a reference to the size of the codec with which the en-default model was trained. This might not be correct, I understand. I also noticed that the size of the default codec in the my chars.py file was larger than 156 characters. I experimented with various character sets in the chars.py file and found that using the following codec of 156 characters while training on the en-default model resolved the issue:

~!"#$%&'()*+,-./0123456789:;<=>?@abcdefghijklmnopqrstuvwxyz[]^_`abcdefghijklmnopqrstuvwxyz{|}¡¢£§©«®°¶»¿ÀÂÄÆÇÈÉÊËÎÏÔÖÙÛÜßàâäæçèéêëîïôö÷ùûüÿŒœŸ†‡•‣‹›€∙▪▫

(N.B. other similar sets of 156 characters may seem to work, but deletions of substitutions of these characters will, I found, lead to remappings of characters in the ocropus-rpred output. For example, an "è" might be consistently output as an "û" or something like that.)

The chars.py file I now use for training on the en-default model is as follows:

digits = u"0123456789"
letters = u"ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"
symbols = ur"""!"#$%&'()*+,-./:;<=>?@[]^_`{|}~"""
ascii = digits+letters+symbols
xsymbols = u"""€¢£»«›‹÷©®†‡°∙•‣¶§÷¡¿▪▫"""
mychars = u"ÀÂÄÆÇÈÉÊËÎÏÔÖÙÛÜßàâäæçèéêëîïôöùûüÿŸŒœ"
default = ascii+xsymbols+mychars
european = default

Changing the chars.py file in this way resolved this issue, at least for me. I have had no further problems training new models on the en-default model since.

JZE, all the characters in your image of the text "Mückendorf 167. 4." are admissible. It is likely however that due to the additional characters in the default codec of the chars.py file you were using the ü was pushed beyond the limit of the 156 character codec and thus caused an error.

zuphilip mentioned this issue Jul 20, 2017

Question: What is the procedure for using Ocropy with non alpha numeric characters #224

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Getting error when trying to teach network on the base of the default model #227

Getting error when trying to teach network on the base of the default model #227

vlad-wonderkidstudio commented Jun 13, 2017 •

edited by kba

Loading

harinath141 commented Jun 18, 2017

Beckenb commented Jun 19, 2017

jze commented Aug 14, 2017

Beckenb commented Oct 23, 2017

mcriggs commented May 10, 2018

Getting error when trying to teach network on the base of the default model #227

Getting error when trying to teach network on the base of the default model #227

Comments

vlad-wonderkidstudio commented Jun 13, 2017 • edited by kba Loading

Expected Behavior

Current Behavior

Possible Solution

Steps to Reproduce (for bugs)

Your Environment

harinath141 commented Jun 18, 2017

Beckenb commented Jun 19, 2017

jze commented Aug 14, 2017

Beckenb commented Oct 23, 2017

mcriggs commented May 10, 2018

vlad-wonderkidstudio commented Jun 13, 2017 •

edited by kba

Loading