old russian / church slavonic glyphs? #24

yurytch · 2018-03-29T07:47:22Z

Is it possible to add support for the Old Russian / Church Slavonic glyphs, at least for the 'yat' (U+0462, U+0463), 'fita' (U+0472, U+0473), and 'izhitsa' (U+0474,U+0475) ?

Shreeshrii · 2018-03-29T09:02:29Z

Which language traineddata are you using currently?

yurytch · 2018-03-29T09:22:19Z

I'm using 'rus' from tessdata_best. Tried adding 'bul' and 'srp', to no avail.
Would be great if there were an additional datafile just for those glyphs recognition, also with cursive (yat!). Does tesseract work like this?

Shreeshrii · 2018-03-29T09:49:44Z

Try with 'rus' from tessdata_fast and see if that is better.
Try the 'pluschar' training using 'rus' from tessdata_best as the continue_from model. Add at least 15 occurrences of the Old Russian / Church Slavonic glyphs that you want to add so that they get picked us in the unicharset.
Also try with script/Cyrillic (or other appropriate script use for Russian).
Please share about 150 lines of training text which has the added glyphs for testing.

yurytch · 2018-03-30T05:02:35Z

@Shreeshrii While I'm trying to make sense of that plus-training procedure (your point 2):
your pt. 1 doesn't work (more OCR errors with 'rus' from *_fast),
I don't understand your pt. 3 - 'rus' is Cyrillic anyway, and 'yat' etc. are Cyrillic., too.
Regarding the pt. 4: do you mean the training text, like for inclusion in the 'rus' training dataset? But wouldn't you want the graphics with real typeset glyphs for that, too?

Shreeshrii · 2018-03-30T05:23:52Z

Ray has trained for languages eg. Eng, rus and also for scripts in which various languages are written eg. Latin script for english, french, German etc.

My suggestion was for you to use script/Cyrrilic to compare results with rus. In case the letters you want to add are in one of the other languages, they might be recognised.

Re. 4, yes along with training text, also need a font which will render those glyphs correctly.

Shreeshrii · 2018-03-30T11:20:01Z

Please review the following files:

https://github.com/tesseract-ocr/langdata/tree/master/rus
https://github.com/tesseract-ocr/langdata/blob/master/rus/desired_characters

https://github.com/tesseract-ocr/langdata/blob/master/Cyrillic.unicharset

Adding these glyphs will require changes in lagdata repo for rus, eg. adding these glyphs to desired_characters file.

maxirmx · 2020-10-08T12:02:17Z

Does anybody know about any progress related to the subject - Old Russian support for tesseract ?

stweil · 2020-10-08T13:14:06Z

@maxirmx, maybe you can contribute by reviewing the files named above?

maxirmx · 2020-10-27T16:24:45Z

@stweil, thank you.
https://github.com/tesseract-ocr/langdata/tree/master/rus is 'modern Russian'.
I have asked about older Russian that included three letters were made obsolete in 1917/1918. They were mentioned in the start of this thread: 'yat' (U+0462, U+0463), 'fita' (U+0472, U+0473), and 'izhitsa' (U+0474,U+0475).
I would imagine additional complications as well such as different paragraph sign and different fonts used at that time.

It is somewhat clear what to do, but I do not want to repeat other's work that might be done already.

stweil · 2020-10-27T16:52:34Z

Okay, "Ѣ" and maybe the other older glyphs are also missing in https://github.com/tesseract-ocr/langdata_lstm/blob/master/script/Cyrillic/Cyrillic.unicharset.

So you will need ground truth data to train a new model based on rus.traineddata or Cyrillic.traineddata, but with the additional glyphs. As soon as you have line images with text transcriptions, this process is supported pretty well with tesstrain.

stweil · 2020-10-27T17:03:50Z

See also issue tesseract-ocr/langdata_lstm#3 which looks like a duplicate. Maybe you can join efforts.

dvrogozh · 2020-12-25T05:15:04Z

@stweil are there any requirements for the training words/text (except beforementioned 150 lines)? For example, how many times each new character should be met in training set? Should there be at least 1 capital and non-capital letter? or something like that?

Arbitrary text(s) in old russian can be obtained, for example, from ru.wikisource.org. For example, https://ru.wikisource.org/wiki/%D0%91%D0%BE%D0%B6%D0%B5%D1%81%D1%82%D0%B2%D0%B5%D0%BD%D0%BD%D0%B0%D1%8F_%D0%BA%D0%BE%D0%BC%D0%B5%D0%B4%D0%B8%D1%8F_(%D0%94%D0%B0%D0%BD%D1%82%D0%B5;_%D0%9C%D0%B8%D0%BD)/%D0%94%D0%9E.

yurytch · 2020-12-25T06:45:46Z

I still fail to comprehend the process well enough.
But I guess I understand why glyph can't be 'added' to an existing dataset -- because of how the deep learning works, right?
But retraining the complete set is rather beyond my resources, in terms of computing power and time.

Here's a thought/question:
would it be useful to train a separate (small) set consisting of those missing glyphs and glyphs that look like those missing ones? I.e. consisting of 'YAT's and 'HARD SIGN's.
Then one could use it in a set of languages:
rus+yat
Would this work at all?

stweil added the enhancement label May 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

old russian / church slavonic glyphs? #24

old russian / church slavonic glyphs? #24

yurytch commented Mar 29, 2018

Shreeshrii commented Mar 29, 2018

yurytch commented Mar 29, 2018

Shreeshrii commented Mar 29, 2018

yurytch commented Mar 30, 2018

Shreeshrii commented Mar 30, 2018

Shreeshrii commented Mar 30, 2018

maxirmx commented Oct 8, 2020 •

edited

Loading

stweil commented Oct 8, 2020

maxirmx commented Oct 27, 2020

stweil commented Oct 27, 2020

stweil commented Oct 27, 2020

dvrogozh commented Dec 25, 2020

yurytch commented Dec 25, 2020

old russian / church slavonic glyphs? #24

old russian / church slavonic glyphs? #24

Comments

yurytch commented Mar 29, 2018

Shreeshrii commented Mar 29, 2018

yurytch commented Mar 29, 2018

Shreeshrii commented Mar 29, 2018

yurytch commented Mar 30, 2018

Shreeshrii commented Mar 30, 2018

Shreeshrii commented Mar 30, 2018

maxirmx commented Oct 8, 2020 • edited Loading

stweil commented Oct 8, 2020

maxirmx commented Oct 27, 2020

stweil commented Oct 27, 2020

stweil commented Oct 27, 2020

dvrogozh commented Dec 25, 2020

yurytch commented Dec 25, 2020

maxirmx commented Oct 8, 2020 •

edited

Loading