-
Notifications
You must be signed in to change notification settings - Fork 591
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extended hocr #283
base: master
Are you sure you want to change the base?
Extended hocr #283
Conversation
Big thanks @zuphilip for all the support and @mittagessen (https://github.com/mittagessen/kraken) for the inspiring work. |
…tions to create an extended hocr file. For more informations see PR: 'Extended hocr'
fbc5323
to
6e5532a
Compare
Just a small note: You might want to split words at Unicode whitespace characters with something like Otherwise it looks fine as the |
ocropy.json and extended hocr
These changes will not compromise any older functions, but giving two new features: 1) json output for each line, 2) hocr output with word boxes and probabilities.
Also the added functionality could (!) replace some older stuff, it won't, and so some calculation will
be done twice.
What will the addition do?
The new code produces a *.ocropy.json file for each line,
which contains:
These information will be used to produce an extended-hocr file with:
(new hocr file of testpage visualized with hocrjs)
How can it be started?
There are new arguments to functions:
If gpageseg get started with
-j/--json
it will produce the first partof the *.ocropy.json.
The following steps (ocropus-rpred, ocropus-hocr) will recognize that a there is a *.ocropy.json file and will automatically work with it. However, it is also possible to suppress some of these steps individually with some additional argument:
Stops adding further information to the json-file.
Note, that if this step will be skipped, then the extended hocr file can't be created. And
will anyway create the hocr file the old way (without probabilities, word boxes).
Finally, there is another parameter
-c,--charconfs
in ocropus-hocr to output the confidence of every char, but since this is increasing the amount of data massively, the default behaviour is not to do this. For usage of this feature:Have fun and a Merry Christmas 🎄 :)