-
Notifications
You must be signed in to change notification settings - Fork 152
[Enhancement]: Propagate ocr confidence to output hocr file #86
[Enhancement]: Propagate ocr confidence to output hocr file #86
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically this change is a good idea, but as done currently, I think it will break libtesseract support (and possibly Cuneiform support).
src/pyocr/builders.py
Outdated
def add_word(self, word, box): | ||
self.word_boxes.append(Box(word, box)) | ||
def add_word(self, word, box, confidence): | ||
self.word_boxes.append(Box(word, box, confidence)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add_word()
functions are called by pyocr.libtesseract.image_to_string()
. Since you haven't updated libtesseract support, this change will break it.
I suggest you try running the tests. Make sure you have cuneiform
, tesseract
, libtesseract
, tesseract-ocr-fra
, and tesseract-ocr-jpn
installed, and simply run ./run_tests.py
.
Unfortunately, the outputs of the tests slightly vary based on the exact version of Tesseract and Cuneiform you're using (and wind direction I guess ....). So you will have to filter the failed tests manually: ignore those that failed just because the output has slightly changed, and just focus on the ones that failed because the API broke due to your changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair point! Here is what I did:
- I've implemented the changes in
libtesseract/__init__.py
andlibtesseract/tesseract_raw.py
to support the extraction of the confidence measure using the C-API - The confidence parameter for the Box constructor has a default value of 0
All in all, this basically means that the confidence measure is propagated to the output hocr files for the tesseract and libtesseract interfaces and a value of 0 is used for all the words when using cuneiform. Do you think that makes sense?
I've done some manual testing with Cuneiform, Tesseract and libtesseract to verify everything was working as expected (looking at the output hocr files). There are however still many failing unit-tests when running the test suite. I must say that it's quite hard to know if it's because I broke the API or just because tesseract feels like spitting out a different output 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I know, they are a pain. Unfortunately I haven't found a better way :/
Hint: if they fail on assertEqual
, it's usually that the output differs from what was expected (aka the wind is blowing north now). For this change, you ignore those tests and focus on those where an uncatched Exception has been raised.
src/pyocr/builders.py
Outdated
|
||
def get_unicode_string(self): | ||
""" | ||
Return the string corresponding to the box, in unicode (utf8). | ||
This string can be stored in a file as-is (see write_box_file()) | ||
and reread using read_box_file(). | ||
""" | ||
return to_unicode("%s %d %d %d %d") % ( | ||
return to_unicode("%s %s %d %d %d %d") % ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As is, it will break tesseract.CharBoxBuilder
.
CharBoxBuilder
correspond to a file format specific to Tesseract (configuration 'makebox'). If you modify this function, the format won't be the same than Tesseract anymore, and CharBoxBuilder.read_file()
won't be able to read files written by CharBoxBuilder.write_file()
anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reverted changes
Thank for the quick review! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking better ; almost ready for inclusion :-)
I assume you ran the tests ? (cuneiform included)
src/pyocr/builders.py
Outdated
@@ -329,7 +331,7 @@ def write_file(file_descriptor, text): | |||
def start_line(self, box): | |||
self.built_text.append(u"") | |||
|
|||
def add_word(self, word, box): | |||
def add_word(self, word, box, confidence=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Either make
confidence
optional for alladd_word()
or for none. But the API of the builders must be the same for all of them. - I would recommend '0' as default value instead. Seems safer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you missed my point 1 :-)
Either you specify a default value for confidence
on all the Builders.add_word() methods (making the argument optional), or on none of them. All builders must all have the same API (see the base class builders.BaseBuilder
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah sorry I did miss you point! The parameter has now been made optional for all add_word
methods
src/pyocr/builders.py
Outdated
continue | ||
confidence = piece.split(" ")[1] | ||
return int(confidence) | ||
raise Exception("Invalid hocr confidence measure: %s" % title) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
This will be a problem. Paperwork uses this code to write and read hOCR files. In other words, there are already a lot of people (me included) with a lot of documents/hOCR files written without the confidence.
This code must not raise an exception if the confidence is not found. However it can display a trace instead (this is not really unexpected nor a problem --> not a warning -->logger.info
).
In other words, please do no break the compatibility (API or file formats) :-) -
The message is not correct. The problem here is not that the confidence is invalid. The problem is that it hasn't been found.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Corrected!
…form will default to confidence=0)
01b1b2c
to
1475c9e
Compare
Yes I did ran the tests. I have 8, 3 and 4 failing tests for libtesseract, tesseract and cuneiform respectively. I suspect most of the tests fail because of the comparisons of the output of the ocr with the content of the |
Good point, test reference outputs will have to be updated. I'll take care of that later. |
Also, what should be the default behaviour if you're parsing an hocr file that doesn't have a confidence measure. Should the Box confidence attribute be set to 0 as it's done with cuneiform? |
Good point, I missed the fact that it's not returning anything. I think 0 is fine. If it wasn't used before, we can safely assume it won't be used now. |
Finding any information about the confidence measure is very hard (not much in the documentation, nothing in the changelog). I could however find a related issue in a cached version of some now defunct code.google thread. It seems that some versions of tesseract (<=3.02) had negative confidence values (between 0 and -7 :/). In the more recent versions it's a number between 0 and 100 (%). So if I think defaulting it to 0 is the safest bet. |
Looks good to me. I'll update the tests later this week and then do a release with this new feature. |
No problem! Thank you for your work on this project, it's been super useful for a project where I work 👍 |
Hi, is there a plan when this feature will result in a new release? |
Sorry, I've been busy with personal matters (moving to another flat, etc) and I forgot to do the release :( (thanks for reminding me :). I'll try to do it this evening (France ; GMT+1). |
Awesome! Thanks a lot :) |
Thanks! |
This PR allows to parse the individual word confidence measures from Tesseract output and write them to the simplified output hocr file in the title attribute of the Box objects.
Example output:
<span class="ocrx_word" title="bbox 638 1797 751 1823; x_wconf 70">Word</span>
Note: directly relates to #74 and #58 and less so to #12