-
Notifications
You must be signed in to change notification settings - Fork 385
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
old russian / church slavonic glyphs? #24
Comments
Which language traineddata are you using currently? |
I'm using 'rus' from tessdata_best. Tried adding 'bul' and 'srp', to no avail. |
|
@Shreeshrii While I'm trying to make sense of that plus-training procedure (your point 2): |
Ray has trained for languages eg. Eng, rus and also for scripts in which various languages are written eg. Latin script for english, french, German etc. My suggestion was for you to use script/Cyrrilic to compare results with rus. In case the letters you want to add are in one of the other languages, they might be recognised. Re. 4, yes along with training text, also need a font which will render those glyphs correctly. |
Please review the following files: https://github.com/tesseract-ocr/langdata/tree/master/rus https://github.com/tesseract-ocr/langdata/blob/master/Cyrillic.unicharset Adding these glyphs will require changes in lagdata repo for rus, eg. adding these glyphs to desired_characters file. |
Does anybody know about any progress related to the subject - Old Russian support for tesseract ? |
@maxirmx, maybe you can contribute by reviewing the files named above? |
@stweil, thank you. It is somewhat clear what to do, but I do not want to repeat other's work that might be done already. |
Okay, "Ѣ" and maybe the other older glyphs are also missing in https://github.com/tesseract-ocr/langdata_lstm/blob/master/script/Cyrillic/Cyrillic.unicharset. So you will need ground truth data to train a new model based on |
See also issue tesseract-ocr/langdata_lstm#3 which looks like a duplicate. Maybe you can join efforts. |
@stweil are there any requirements for the training words/text (except beforementioned 150 lines)? For example, how many times each new character should be met in training set? Should there be at least 1 capital and non-capital letter? or something like that? Arbitrary text(s) in old russian can be obtained, for example, from ru.wikisource.org. For example, https://ru.wikisource.org/wiki/%D0%91%D0%BE%D0%B6%D0%B5%D1%81%D1%82%D0%B2%D0%B5%D0%BD%D0%BD%D0%B0%D1%8F_%D0%BA%D0%BE%D0%BC%D0%B5%D0%B4%D0%B8%D1%8F_(%D0%94%D0%B0%D0%BD%D1%82%D0%B5;_%D0%9C%D0%B8%D0%BD)/%D0%94%D0%9E. |
I still fail to comprehend the process well enough. Here's a thought/question: |
Is it possible to add support for the Old Russian / Church Slavonic glyphs, at least for the 'yat' (U+0462, U+0463), 'fita' (U+0472, U+0473), and 'izhitsa' (U+0474,U+0475) ?
The text was updated successfully, but these errors were encountered: