-
Notifications
You must be signed in to change notification settings - Fork 592
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add option to output <span class=ocr_word> elements to hocr #314
base: master
Are you sure you want to change the base?
Conversation
You mean |
Thanks, I've edited the comments. I'll look into how I change this in the source code so that the changes pertain in the pull request. |
Looks good, thanks. Will there be whitespace between the word spans? To do html2txt, for screenreaders etc.? |
ocropus-hocr
Outdated
previous_char_x = char_x | ||
PN("</span>") | ||
except: | ||
E("Data for ocr_word elements is not available. Did you select --llocs in ocropus-gpageseg?") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/gpageseg/rpred/
PN("</span>") | ||
except: | ||
E("Data for ocr_word elements is not available. Did you select --llocs in ocropus-gpageseg?") | ||
PN(" class='ocr_line' title='%s'>"%info,text,"</span>") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indentation looks off on GitHub.
Yes, there is whitespace between the elements. There is not whitespace after the final and the that closes the hocr_line. Apropos formatting, is there a beautifier command that I can run the code through to conform to this project? |
I think kraken also has something like this feature. |
👍
PEP8. We discussed beautifying the whole code base but decided against it at the time, because it change every second line and make blameing harder. |
Do you take into account the fact that each 'loc' is just one spot that can be in the start / middle / end of glyph? |
Amit -- Because of this issue, the assigned break between words is the midpoint of the space between them. (This is noted in the code comments.) This ensures, or tries to, that no part of a cc of a glyph is cut off. It does mean that the word bounding box has some extra space on either side and that each word bbox is adjacent to the next. I feel this is a good compromise, given the data available, since it can be used for retraining, cropping images of words and so forth. I'll provide a visualization later today. |
This is the corresponding plaintext output, verifying my analysis of the errors above: |
For what it's worth, it's clear we could improve on this code to generate the 'true' bbox of the word by finding the smallest rectangle around all the ccs within the bbox provided by the routine offered in this pull request. If someone could recommend a library, preferably already imported by Ocropus, that does this or that would be best to modify to this purpose, I'd be happy to work on it for a future pull request. |
… as to if they use leading or trailing edge so this is the best we can do.)
This adds a'-w' switch to ocropus-hocr, which will cause it to generate elements containing each word's text and validly nested within the appropriate element. It depends on the .llocs files generated by ocropus-rpred. If these are not available, or the switch is not turned on, it uses the old behaviour.
It should be noted that text output from ocropus-hocr with and without the -w might differ. In particular, initial and final spaces are stripped from lines when the -w switch is on because this tends to generate poor bounding boxes.