Switch to NFC normalisation by default #257

nickjwhite · 2017-11-01T08:27:21Z

Switch from NFKC normalisation to NFC normalisation by default. NFC
normalisation is more appropriate for OCR, as different characters
which may be semantically similar are nevertheless often useful to
capture and output in their original form.

Switch from NFKC normalisation to NFC normalisation by default. NFC normalisation is more appropriate for OCR, as different characters which may be semantically similar are nevertheless often useful to capture and output in their original form.

kba · 2017-12-08T16:10:30Z

Can you elaborate on why NFKC is better suited for OCR than NFC? Can you give example data where NFKC is superior and ideally test data for CI? How will this influence recognition with the widely used en-default and fraktur models?

I'm reluctant to merge this until I fully understand the consequences.

nickjwhite · 2017-12-12T20:33:20Z

Thanks for commenting @kba.

My use case for NFC is to recognise and differentiate long s (ſ) from a short s (s) in old Latin documents. NFKC recognises that they're semantically interchangable, and so changes the long s into a short s, meaning that I can't differentiate them in the OCR output. I realised this as I found that the long s I had in my codec input text wasn't included in the codec debug output, but it also meant that the long s characters in my ground truth were silently changed to short s characters.

It makes sense to me to only normalise down to glyphs that are identical, and not that are deemed by Unicode to be "equivalent", so that other such characters can be differentiated. To use the example above of long and short s, with this patch one could still just transcribe all long s characters in ground truth with a short s if the differentiation wasn't important, the difference is that if it is relevant then the different glyphs can be preserved and correctly represented with an appropriate model.

I can't think of any cases where this could cause a regression in other training qualities, unless I suppose the ground truth files depended on using different unicode characters that would be normalised to the same character when training. I am not expert in non-Latin or Greek scripts, so perhaps that could be an issue, but that would surprise me.

I hope this all makes sense. Do ping me for clarification if I have been too verbose and unclear!

(edited as I got NFKC and NFC the wrong way around in this comment initially - sorry!)

nickjwhite · 2017-12-12T21:40:46Z

I just tested all the ground truth that was linked to from the wiki, comparing differences between NFC and NFKC: https://github.com/tmbdev/ocropy/wiki/Models

https://github.com/ChillarAnand/likitham (Telugu): no difference
https://github.com/jze/ocropus-model_fraktur (Fraktur): no difference
https://github.com/jze/ocropus-model_oesterreich-ungarn (German): no difference

https://github.com/zuphilip/ocropy-french-models (French): Only difference is several instances of this:
NFC: …
NFKC: ...

https://github.com/jze/ocropus-model_cyrillic (Cyrillic): 1 instance of a difference
NFC: ¾
NFKC: 3⁄4

https://github.com/isaomatsunami/clstm-Japanese (Japanese): many differences, but all small variants of which these seem representative:
NFC: 膀紛惟賢筋紬鉱確板藪垢ト木撰詠腹蒋〉筋静ぞ⑬端宥罷橿懇培上鋼
NFKC: 膀紛惟賢筋紬鉱確板藪垢ト木撰詠腹蒋〉筋静ぞ13端宥罷橿懇培上鋼

NFC: 盗紺r貧双員急媚牙鳥忠傑勃或架届循諭繁敢想疫別緯容く茶ｂ游髙
NFKC: 盗紺r貧双員急媚牙鳥忠傑勃或架届循諭繁敢想疫別緯容く茶b游髙

NFC: 染し璧ク偏郁承枚莉析蒋呪拡霊繊覚産⓯＋剝匿節￥司借垂彌備貴菊
NFKC: 染し璧ク偏郁承枚莉析蒋呪拡霊繊覚産⓯+剝匿節¥司借垂彌備貴菊

NFC: 仰佛ル祖留器碕膚蛛納准嚥廠㍑抑抜提騙ん禍駐尽陵繭弩未汰責脳商
NFKC: 仰佛ル祖留器碕膚蛛納准嚥廠リットル抑抜提騙ん禍駐尽陵繭弩未汰責脳商

I also checked the text in ocropy's tests/ directory, and there was no difference between NFC and NFKC.

From looking at all of these, it still seems to me that NFC is the best option, as it follows the principle of least surprise: it will ensure that whatever glyph is encoded in the ground truth will be used for the model. I suspect that the people creating these models didn't expect the OCR to alter their characters from the ground truth the way NFKC does.

I couldn't see the ground truth for the English and Fraktur models you mention, but I'd be happy to compare them too if they're available.

amitdo · 2017-12-13T09:16:48Z

@nickjwhite

About 'long-s':
https://www.mail-archive.com/[email protected]/msg01569.html

Beckenb · 2017-12-13T11:08:10Z

About 'long-s':
https://www.mail-archive.com/[email protected]/msg01569.html

I would recommend just recognizing it with the default Fraktur model and
choosing long/short s based on context; there are very few cases where the
choice can't be made programmatically, and you should be able to find those
with a simple script.

Tom is proposing to change "short s" to "long s" after the ocr; while this may work relatively easy on more recent fraktur texts (eg. 19th century) the older the texts get the more difficult this is; for example incunabula (1450-1500) have no uniform grammar or spelling to follow, so it is crucial that the ocr reflects the print as closely as possible.

nickjwhite · 2017-12-13T11:49:10Z

@Beckenb is correct, and moreover capturing places where long s is used in a way that is not "correct" can be useful information to capture, in some cases.

There will also be cases of other characters for which it is important to recognise the particular glyph, even if it is considered "equivalent" to a different one by unicode. Long s is just a well-known, obvious example.

amitdo · 2017-12-13T15:28:59Z

@nickjwhite,

Maybe you want to explore what Tesseract 4.00 is doing.
https://github.com/tesseract-ocr/tesseract/search?q=UnicodeNormMode

mittagessen · 2018-01-07T11:11:11Z

I want to throw in NFD as the default mode as well and strongly agree that any of the canonical normalizations are too lossy for OCR on historical prints. NFD has the benefit over NFC that codecs are smaller resulting in a slight boost in recognition accuracy (<1%) and speed (probably unnoticable). For highly diacritized (?) scripts such as polytonic Greek the codec size will be roughly cut in half.

The only drawback is that polytonic Greek output will look worse on displays as the glyphs in most fonts are only defined for the combined code points.

Switch to NFC normalisation by default

839467d

Switch from NFKC normalisation to NFC normalisation by default. NFC normalisation is more appropriate for OCR, as different characters which may be semantically similar are nevertheless often useful to capture and output in their original form.

kba added the 🗣️ discussion label Dec 8, 2017

kba mentioned this pull request Dec 8, 2017

Handle combining characters correctly in a codec #261

Open

nickjwhite mentioned this pull request Aug 31, 2018

Add NFC and NFD normalization options (keeping NFKC as the default) #313

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch to NFC normalisation by default #257

Switch to NFC normalisation by default #257

nickjwhite commented Nov 1, 2017

kba commented Dec 8, 2017

nickjwhite commented Dec 12, 2017 •

edited

Loading

nickjwhite commented Dec 12, 2017 •

edited by kba

Loading

amitdo commented Dec 13, 2017

Beckenb commented Dec 13, 2017

nickjwhite commented Dec 13, 2017

amitdo commented Dec 13, 2017

mittagessen commented Jan 7, 2018

Switch to NFC normalisation by default #257

Are you sure you want to change the base?

Switch to NFC normalisation by default #257

Conversation

nickjwhite commented Nov 1, 2017

kba commented Dec 8, 2017

nickjwhite commented Dec 12, 2017 • edited Loading

nickjwhite commented Dec 12, 2017 • edited by kba Loading

amitdo commented Dec 13, 2017

Beckenb commented Dec 13, 2017

nickjwhite commented Dec 13, 2017

amitdo commented Dec 13, 2017

mittagessen commented Jan 7, 2018

nickjwhite commented Dec 12, 2017 •

edited

Loading

nickjwhite commented Dec 12, 2017 •

edited by kba

Loading