Skip to content
This repository has been archived by the owner on Jun 14, 2018. It is now read-only.

[Enhancement]: Propagate ocr confidence to output hocr file #86

Merged

Conversation

a-pagano
Copy link
Contributor

@a-pagano a-pagano commented Nov 30, 2017

This PR allows to parse the individual word confidence measures from Tesseract output and write them to the simplified output hocr file in the title attribute of the Box objects.

Example output: <span class="ocrx_word" title="bbox 638 1797 751 1823; x_wconf 70">Word</span>

Note: directly relates to #74 and #58 and less so to #12

Copy link
Member

@jflesch jflesch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically this change is a good idea, but as done currently, I think it will break libtesseract support (and possibly Cuneiform support).

def add_word(self, word, box):
self.word_boxes.append(Box(word, box))
def add_word(self, word, box, confidence):
self.word_boxes.append(Box(word, box, confidence))
Copy link
Member

@jflesch jflesch Nov 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add_word() functions are called by pyocr.libtesseract.image_to_string(). Since you haven't updated libtesseract support, this change will break it.

I suggest you try running the tests. Make sure you have cuneiform, tesseract, libtesseract, tesseract-ocr-fra, and tesseract-ocr-jpn installed, and simply run ./run_tests.py.

Unfortunately, the outputs of the tests slightly vary based on the exact version of Tesseract and Cuneiform you're using (and wind direction I guess ....). So you will have to filter the failed tests manually: ignore those that failed just because the output has slightly changed, and just focus on the ones that failed because the API broke due to your changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point! Here is what I did:

  • I've implemented the changes in libtesseract/__init__.py and libtesseract/tesseract_raw.py to support the extraction of the confidence measure using the C-API
  • The confidence parameter for the Box constructor has a default value of 0

All in all, this basically means that the confidence measure is propagated to the output hocr files for the tesseract and libtesseract interfaces and a value of 0 is used for all the words when using cuneiform. Do you think that makes sense?

I've done some manual testing with Cuneiform, Tesseract and libtesseract to verify everything was working as expected (looking at the output hocr files). There are however still many failing unit-tests when running the test suite. I must say that it's quite hard to know if it's because I broke the API or just because tesseract feels like spitting out a different output 😅

Copy link
Member

@jflesch jflesch Nov 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I know, they are a pain. Unfortunately I haven't found a better way :/
Hint: if they fail on assertEqual, it's usually that the output differs from what was expected (aka the wind is blowing north now). For this change, you ignore those tests and focus on those where an uncatched Exception has been raised.


def get_unicode_string(self):
"""
Return the string corresponding to the box, in unicode (utf8).
This string can be stored in a file as-is (see write_box_file())
and reread using read_box_file().
"""
return to_unicode("%s %d %d %d %d") % (
return to_unicode("%s %s %d %d %d %d") % (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As is, it will break tesseract.CharBoxBuilder.
CharBoxBuilder correspond to a file format specific to Tesseract (configuration 'makebox'). If you modify this function, the format won't be the same than Tesseract anymore, and CharBoxBuilder.read_file() won't be able to read files written by CharBoxBuilder.write_file() anymore.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted changes

@a-pagano
Copy link
Contributor Author

Thank for the quick review!

Copy link
Member

@jflesch jflesch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking better ; almost ready for inclusion :-)
I assume you ran the tests ? (cuneiform included)

@@ -329,7 +331,7 @@ def write_file(file_descriptor, text):
def start_line(self, box):
self.built_text.append(u"")

def add_word(self, word, box):
def add_word(self, word, box, confidence=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Either make confidence optional for all add_word() or for none. But the API of the builders must be the same for all of them.
  2. I would recommend '0' as default value instead. Seems safer.

Copy link
Member

@jflesch jflesch Nov 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you missed my point 1 :-)
Either you specify a default value for confidence on all the Builders.add_word() methods (making the argument optional), or on none of them. All builders must all have the same API (see the base class builders.BaseBuilder)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah sorry I did miss you point! The parameter has now been made optional for all add_word methods

continue
confidence = piece.split(" ")[1]
return int(confidence)
raise Exception("Invalid hocr confidence measure: %s" % title)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. This will be a problem. Paperwork uses this code to write and read hOCR files. In other words, there are already a lot of people (me included) with a lot of documents/hOCR files written without the confidence.
    This code must not raise an exception if the confidence is not found. However it can display a trace instead (this is not really unexpected nor a problem --> not a warning --> logger.info).
    In other words, please do no break the compatibility (API or file formats) :-)

  2. The message is not correct. The problem here is not that the confidence is invalid. The problem is that it hasn't been found.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected!

@a-pagano a-pagano force-pushed the enhancement/add-confidence-measure branch from 01b1b2c to 1475c9e Compare November 30, 2017 14:23
@a-pagano
Copy link
Contributor Author

Yes I did ran the tests. I have 8, 3 and 4 failing tests for libtesseract, tesseract and cuneiform respectively. I suspect most of the tests fail because of the comparisons of the output of the ocr with the content of the .words and .files files present in the tests/output/ folder since they do not contain the confidence measure.

@jflesch
Copy link
Member

jflesch commented Nov 30, 2017

Good point, test reference outputs will have to be updated. I'll take care of that later.

@a-pagano
Copy link
Contributor Author

a-pagano commented Nov 30, 2017

Also, what should be the default behaviour if you're parsing an hocr file that doesn't have a confidence measure. Should the Box confidence attribute be set to 0 as it's done with cuneiform?
I'm asking because with the change you requested in the __parse_confidence a trace will be logged and nothing is returned (effectively returning None) -- > the confidence attribute of the Box object is set to None --> the call to get_xml_tag breaks because it's expecting a digit in the string it's printing

@jflesch
Copy link
Member

jflesch commented Nov 30, 2017

Good point, I missed the fact that it's not returning anything.

I think 0 is fine. If it wasn't used before, we can safely assume it won't be used now.
Or we can play it even safer and set it to -1. I assume Tesseract doesn't use negative values for the confidence ? (personally, I haven't looked at the confidence scores yet)

@a-pagano
Copy link
Contributor Author

Finding any information about the confidence measure is very hard (not much in the documentation, nothing in the changelog). I could however find a related issue in a cached version of some now defunct code.google thread. It seems that some versions of tesseract (<=3.02) had negative confidence values (between 0 and -7 :/). In the more recent versions it's a number between 0 and 100 (%).

So if I think defaulting it to 0 is the safest bet.

@jflesch
Copy link
Member

jflesch commented Nov 30, 2017

Looks good to me. I'll update the tests later this week and then do a release with this new feature.
Thank you :-)

@jflesch jflesch merged commit 30182d1 into openpaperwork:master Nov 30, 2017
@a-pagano
Copy link
Contributor Author

No problem! Thank you for your work on this project, it's been super useful for a project where I work 👍

@crazzle
Copy link

crazzle commented Dec 14, 2017

Hi, is there a plan when this feature will result in a new release?
Thanks!

@jflesch
Copy link
Member

jflesch commented Dec 14, 2017

Sorry, I've been busy with personal matters (moving to another flat, etc) and I forgot to do the release :( (thanks for reminding me :).

I'll try to do it this evening (France ; GMT+1).

@crazzle
Copy link

crazzle commented Dec 14, 2017

Awesome! Thanks a lot :)

@jflesch
Copy link
Member

jflesch commented Dec 14, 2017

Done : https://pypi.python.org/pypi?:action=display&name=pyocr&version=0.5 :)

@crazzle
Copy link

crazzle commented Dec 14, 2017

Thanks!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants