[Enhancement]: Propagate ocr confidence to output hocr file #86

a-pagano · 2017-11-30T09:26:59Z

This PR allows to parse the individual word confidence measures from Tesseract output and write them to the simplified output hocr file in the title attribute of the Box objects.

Example output: <span class="ocrx_word" title="bbox 638 1797 751 1823; x_wconf 70">Word</span>

Note: directly relates to #74 and #58 and less so to #12

jflesch

Basically this change is a good idea, but as done currently, I think it will break libtesseract support (and possibly Cuneiform support).

jflesch · 2017-11-30T09:43:45Z

src/pyocr/builders.py

-    def add_word(self, word, box):
-        self.word_boxes.append(Box(word, box))
+    def add_word(self, word, box, confidence):
+        self.word_boxes.append(Box(word, box, confidence))


add_word() functions are called by pyocr.libtesseract.image_to_string(). Since you haven't updated libtesseract support, this change will break it.

I suggest you try running the tests. Make sure you have cuneiform, tesseract, libtesseract, tesseract-ocr-fra, and tesseract-ocr-jpn installed, and simply run ./run_tests.py.

Unfortunately, the outputs of the tests slightly vary based on the exact version of Tesseract and Cuneiform you're using (and wind direction I guess ....). So you will have to filter the failed tests manually: ignore those that failed just because the output has slightly changed, and just focus on the ones that failed because the API broke due to your changes.

Fair point! Here is what I did:

I've implemented the changes in libtesseract/__init__.py and libtesseract/tesseract_raw.py to support the extraction of the confidence measure using the C-API

The confidence parameter for the Box constructor has a default value of 0

All in all, this basically means that the confidence measure is propagated to the output hocr files for the tesseract and libtesseract interfaces and a value of 0 is used for all the words when using cuneiform. Do you think that makes sense?

I've done some manual testing with Cuneiform, Tesseract and libtesseract to verify everything was working as expected (looking at the output hocr files). There are however still many failing unit-tests when running the test suite. I must say that it's quite hard to know if it's because I broke the API or just because tesseract feels like spitting out a different output 😅

Yeah, I know, they are a pain. Unfortunately I haven't found a better way :/
Hint: if they fail on assertEqual, it's usually that the output differs from what was expected (aka the wind is blowing north now). For this change, you ignore those tests and focus on those where an uncatched Exception has been raised.

jflesch · 2017-11-30T09:51:09Z

src/pyocr/builders.py


    def get_unicode_string(self):
        """
        Return the string corresponding to the box, in unicode (utf8).
        This string can be stored in a file as-is (see write_box_file())
        and reread using read_box_file().
        """
-        return to_unicode("%s %d %d %d %d") % (
+        return to_unicode("%s %s %d %d %d %d") % (


As is, it will break tesseract.CharBoxBuilder.
CharBoxBuilder correspond to a file format specific to Tesseract (configuration 'makebox'). If you modify this function, the format won't be the same than Tesseract anymore, and CharBoxBuilder.read_file() won't be able to read files written by CharBoxBuilder.write_file() anymore.

Reverted changes

a-pagano · 2017-11-30T14:03:22Z

Thank for the quick review!

jflesch

This is looking better ; almost ready for inclusion :-)
I assume you ran the tests ? (cuneiform included)

jflesch · 2017-11-30T14:06:21Z

src/pyocr/builders.py

@@ -329,7 +331,7 @@ def write_file(file_descriptor, text):
    def start_line(self, box):
        self.built_text.append(u"")

-    def add_word(self, word, box):
+    def add_word(self, word, box, confidence=None):


Either make confidence optional for all add_word() or for none. But the API of the builders must be the same for all of them.

I would recommend '0' as default value instead. Seems safer.

I think you missed my point 1 :-)
Either you specify a default value for confidence on all the Builders.add_word() methods (making the argument optional), or on none of them. All builders must all have the same API (see the base class builders.BaseBuilder)

Ah sorry I did miss you point! The parameter has now been made optional for all add_word methods

jflesch · 2017-11-30T14:10:43Z

src/pyocr/builders.py

+                continue
+            confidence = piece.split(" ")[1]
+            return int(confidence)
+        raise Exception("Invalid hocr confidence measure: %s" % title)


This will be a problem. Paperwork uses this code to write and read hOCR files. In other words, there are already a lot of people (me included) with a lot of documents/hOCR files written without the confidence.
This code must not raise an exception if the confidence is not found. However it can display a trace instead (this is not really unexpected nor a problem --> not a warning --> logger.info).
In other words, please do no break the compatibility (API or file formats) :-)

The message is not correct. The problem here is not that the confidence is invalid. The problem is that it hasn't been found.

…form will default to confidence=0)

a-pagano · 2017-11-30T14:48:43Z

Yes I did ran the tests. I have 8, 3 and 4 failing tests for libtesseract, tesseract and cuneiform respectively. I suspect most of the tests fail because of the comparisons of the output of the ocr with the content of the .words and .files files present in the tests/output/ folder since they do not contain the confidence measure.

jflesch · 2017-11-30T14:52:04Z

Good point, test reference outputs will have to be updated. I'll take care of that later.

a-pagano · 2017-11-30T14:54:43Z

Also, what should be the default behaviour if you're parsing an hocr file that doesn't have a confidence measure. Should the Box confidence attribute be set to 0 as it's done with cuneiform?
I'm asking because with the change you requested in the __parse_confidence a trace will be logged and nothing is returned (effectively returning None) -- > the confidence attribute of the Box object is set to None --> the call to get_xml_tag breaks because it's expecting a digit in the string it's printing

jflesch · 2017-11-30T15:03:02Z

Good point, I missed the fact that it's not returning anything.

I think 0 is fine. If it wasn't used before, we can safely assume it won't be used now.
Or we can play it even safer and set it to -1. I assume Tesseract doesn't use negative values for the confidence ? (personally, I haven't looked at the confidence scores yet)

a-pagano · 2017-11-30T15:21:57Z

Finding any information about the confidence measure is very hard (not much in the documentation, nothing in the changelog). I could however find a related issue in a cached version of some now defunct code.google thread. It seems that some versions of tesseract (<=3.02) had negative confidence values (between 0 and -7 :/). In the more recent versions it's a number between 0 and 100 (%).

So if I think defaulting it to 0 is the safest bet.

jflesch · 2017-11-30T15:34:45Z

Looks good to me. I'll update the tests later this week and then do a release with this new feature.
Thank you :-)

a-pagano · 2017-11-30T15:45:41Z

No problem! Thank you for your work on this project, it's been super useful for a project where I work 👍

crazzle · 2017-12-14T10:04:13Z

Hi, is there a plan when this feature will result in a new release?
Thanks!

jflesch · 2017-12-14T10:10:16Z

Sorry, I've been busy with personal matters (moving to another flat, etc) and I forgot to do the release :( (thanks for reminding me :).

I'll try to do it this evening (France ; GMT+1).

crazzle · 2017-12-14T10:12:38Z

Awesome! Thanks a lot :)

jflesch · 2017-12-14T20:51:03Z

Done : https://pypi.python.org/pypi?:action=display&name=pyocr&version=0.5 :)

crazzle · 2017-12-14T20:58:48Z

Thanks!

Propagate ocr confidence to output hocr file when using tesseract

c2460ad

jflesch suggested changes Nov 30, 2017

View reviewed changes

Add support for caclulating confidence measure in libtesseract (Cunei…

1475c9e

…form will default to confidence=0)

a-pagano force-pushed the enhancement/add-confidence-measure branch from 01b1b2c to 1475c9e Compare November 30, 2017 14:23

Log trace when confidence measure not found instead of raising exception

3ef33b3

a-pagano added 2 commits November 30, 2017 16:24

Return 0 when no confidence measure is found

57b96f0

Make confidence parameter optional in all add_word methods

ec5e511

jflesch merged commit 30182d1 into openpaperwork:master Nov 30, 2017

This was referenced Nov 30, 2017

Confidence score #58

Open

Could we get a confidence value by each word? #74

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Enhancement]: Propagate ocr confidence to output hocr file #86

[Enhancement]: Propagate ocr confidence to output hocr file #86

a-pagano commented Nov 30, 2017 •

edited

Loading

jflesch left a comment

jflesch Nov 30, 2017 •

edited

Loading

a-pagano Nov 30, 2017

jflesch Nov 30, 2017 •

edited

Loading

jflesch Nov 30, 2017

a-pagano Nov 30, 2017

a-pagano commented Nov 30, 2017

jflesch left a comment •

edited

Loading

jflesch Nov 30, 2017

jflesch Nov 30, 2017 •

edited

Loading

a-pagano Nov 30, 2017

jflesch Nov 30, 2017

a-pagano Nov 30, 2017

a-pagano commented Nov 30, 2017

jflesch commented Nov 30, 2017

a-pagano commented Nov 30, 2017 •

edited

Loading

jflesch commented Nov 30, 2017

a-pagano commented Nov 30, 2017

jflesch commented Nov 30, 2017

a-pagano commented Nov 30, 2017

crazzle commented Dec 14, 2017

jflesch commented Dec 14, 2017

crazzle commented Dec 14, 2017

jflesch commented Dec 14, 2017

crazzle commented Dec 14, 2017

[Enhancement]: Propagate ocr confidence to output hocr file #86

[Enhancement]: Propagate ocr confidence to output hocr file #86

Conversation

a-pagano commented Nov 30, 2017 • edited Loading

jflesch left a comment

Choose a reason for hiding this comment

jflesch Nov 30, 2017 • edited Loading

Choose a reason for hiding this comment

a-pagano Nov 30, 2017

Choose a reason for hiding this comment

jflesch Nov 30, 2017 • edited Loading

Choose a reason for hiding this comment

jflesch Nov 30, 2017

Choose a reason for hiding this comment

a-pagano Nov 30, 2017

Choose a reason for hiding this comment

a-pagano commented Nov 30, 2017

jflesch left a comment • edited Loading

Choose a reason for hiding this comment

jflesch Nov 30, 2017

Choose a reason for hiding this comment

jflesch Nov 30, 2017 • edited Loading

Choose a reason for hiding this comment

a-pagano Nov 30, 2017

Choose a reason for hiding this comment

jflesch Nov 30, 2017

Choose a reason for hiding this comment

a-pagano Nov 30, 2017

Choose a reason for hiding this comment

a-pagano commented Nov 30, 2017

jflesch commented Nov 30, 2017

a-pagano commented Nov 30, 2017 • edited Loading

jflesch commented Nov 30, 2017

a-pagano commented Nov 30, 2017

jflesch commented Nov 30, 2017

a-pagano commented Nov 30, 2017

crazzle commented Dec 14, 2017

jflesch commented Dec 14, 2017

crazzle commented Dec 14, 2017

jflesch commented Dec 14, 2017

crazzle commented Dec 14, 2017

a-pagano commented Nov 30, 2017 •

edited

Loading

jflesch Nov 30, 2017 •

edited

Loading

jflesch Nov 30, 2017 •

edited

Loading

jflesch left a comment •

edited

Loading

jflesch Nov 30, 2017 •

edited

Loading

a-pagano commented Nov 30, 2017 •

edited

Loading