-
Notifications
You must be signed in to change notification settings - Fork 591
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Getting error when trying to teach network on the base of the default model #227
Comments
Try this command: |
This error occurs when you load an existing model and try to teach it characters it has never seen before. In other words, the codec (or chars.py) you are using has characters in it that are not included in the initial training of the en-default model. As far as I know there is no way to expand the codec/character set once training has started. |
I can't say what codec was used to train the en-default model (I don't know if there is any way at all?) - but I highly doubt that they used the default codes, since it includes german and french characters. |
I had a similar issue. To solve this issue for myself, I first interpreted the "size 156" in the error message as a reference to the size of the codec with which the en-default model was trained. This might not be correct, I understand. I also noticed that the size of the default codec in the my chars.py file was larger than 156 characters. I experimented with various character sets in the chars.py file and found that using the following codec of 156 characters while training on the en-default model resolved the issue: ~!"#$%&'()*+,-./0123456789:;<=>?@abcdefghijklmnopqrstuvwxyz[]^_`abcdefghijklmnopqrstuvwxyz{|}¡¢£§©«®°¶»¿ÀÂÄÆÇÈÉÊËÎÏÔÖÙÛÜßàâäæçèéêëîïôö÷ùûüÿŒœŸ†‡•‣‹›€∙▪▫ (N.B. other similar sets of 156 characters may seem to work, but deletions of substitutions of these characters will, I found, lead to remappings of characters in the ocropus-rpred output. For example, an "è" might be consistently output as an "û" or something like that.) The chars.py file I now use for training on the en-default model is as follows: digits = u"0123456789" Changing the chars.py file in this way resolved this issue, at least for me. I have had no further problems training new models on the en-default model since. JZE, all the characters in your image of the text "Mückendorf 167. 4." are admissible. It is likely however that due to the additional characters in the default codec of the chars.py file you were using the ü was pushed beyond the limit of the 156 character codec and thus caused an error. |
Expected Behavior
If I try to teach the network on the base of the existing model, it should work fine
Current Behavior
If I try to teach the network on the base of the existing model, I always get the following error
(sometimes just after launching the app, sometimes in 10-30 seconds)
Possible Solution
Steps to Reproduce (for bugs)
Your Environment
Git revision of ocropy:
commit 358df8d
Merge: dacf0fc e016e74
Author: Philipp Zumstein [email protected]
Date: Mon May 22 22:38:33 2017 +0200
Merge pull request Delete unused function bounding_box in ocropus-linegen #219 from tmbdev/del-bbox-func
Delete unused function bounding_box in ocropus-linegen
Operating System and version:
Ubuntu 14.04.1 LTS
The text was updated successfully, but these errors were encountered: