-
Notifications
You must be signed in to change notification settings - Fork 385
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tesseract fails to recognize % sign in Hungarian language texts #40
Comments
See hun.unicharset which shows all known characters. The percent sign was not part of the training data set, so Tesseract simply does not know that character with This can only be solved by new training, either from scratch or by fine tuning the existing incomplete model (which can add new characters). An alternative would be using the |
A model "which supports all western Europe languages" is not an option for Hungarian, because of ő and ű, which are not found in any Western European language. |
Have you tried with |
Latin.unicharset includes both characters, so I suggest to try it. I updated my previous comment. |
@stweil the more relevant unicharset will be the lstm-unicharset extracted from the script/Latin traineddata file. Latin.unicharset maybe a superset. |
You are right, it is not identical, but that one also includes both characters. |
How do I test Latin.unicharset? Do I need to train a new model? |
Just get |
When the language is set to "hun" (Hungarian), Tesseract is unable to recognize the % sign. This sign is very commonly used in Hungarian to represent percentages, the same way as in English. Tesseract instead sees various letters and digits - most commonly "96", sometimes "9", "69", "0", "S", "Z", or even nothing at all.
Even if I feed a generated image containing a % sign in large black type on a pure white background, I still can't get Tesseract to output the % sign, as long as the language is set to Hungarian.
Both the "fast" and "best" models suffer from this problem.
If I instead set the language to English, % sings are recognized without issue.
The text was updated successfully, but these errors were encountered: