-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pdfcomp: problems with inverted text that is often better in hocr. #55
Comments
If I invert the complete image via https://pinetools.com/invert-image-colors and repeat the steps all text seems correct in tesseract and sharp in the resulting PDF, despite both inverted and non-inverted text on the page: |
I found a workaround to get the OCR correct: Create a file tess.cfg containing
And call
The OCR text is now looking fine, however pdfcomp is crashing on this result:
|
The new parameter Stefan Weil suggests gives the same error. |
For the record |
When I look at the extracted hocr from this "array"-containing PDF it twice contains the "wis-clear" part on the right top of the image, unfortunately both with confidence 100. I guess ocrmypdf should already decide for the best one:
|
There's a few things at play:, and all could be at fault:
|
You can already see these are separately recognized words, for example the third coordinate of the first 'w' differs from the second. But Stefan says this is not by design, so I guess he'll adapt it in tesseract. The old functionality with tessedit_do_invert=True already gave an "array" instead of a "name" in pdf-metadata-json, that might be an alert for multiple values and correlated. |
I didn't get the print/wis-clear correctly read in automatically in plain Tesseract. Looking around for a solution I stumbled into EasyOCR, which doesn't have HOCR-output, but comes with something similar when you just follow the main readme and print(result) for languages nl, en: EasyOCR has the name of performing better than Tesseract on automatically segmenting and recognizing. [([[107, 181], [500, 181], [500, 306], [107, 306]], 'KVK', 0.9998589158058167), ([[546, 212], [659, 212], [659, 303], [546, 303]], '14', 0.998583705333999), ([[697, 209], [1079, 209], [1079, 333], [697, 333]], 'Wijziging', 0.9999568060343197), ([[2187, 323], [2264, 323], [2264, 359], [2187, 359]], 'print', 0.9999612420764422), ([[546, 359], [1337, 359], [1337, 424], [546, 424]], 'Ondernemings- en vestigingsgegevens', 0.974492532298182), ([[2188, 368], [2244, 368], [2244, 399], [2188, 399]], 'wis', 0.9999067420092292), ([[2262, 368], [2340, 368], [2340, 399], [2262, 399]], 'clear', 0.9999880581002243), ([[545, 600], [866, 600], [866, 636], [545, 636]], 'Waarom dit formulier?', 0.9340071795845962), ([[992, 600], [1373, 600], [1373, 641], [992, 641]], 'voor het doorgeven van bijvoor-', 0.7671805513896455), ([[1433, 600], [1853, 600], [1853, 641], [1433, 641]], 'Waarom het handelsregister?', 0.9457987802731944), ([[548, 638], [902, 638], [902, 677], [548, 677]], 'Met dit formulier kunt u wijzi-', 0.8107827908344889), ([[992, 638], [1384, 638], [1384, 677], [992, 677]], 'beeld veranderde kapitaalsgege-', 0.7377568067143137), ([[1433, 638], [1752, 638], [1752, 677], [1433, 677]], 'Het inschrijven van onder-', 0.9876152636674521), ([[1894, 636], [2371, 636], [2371, 677], [1894, 677]], 'Dit gedeelte wordt door KVK ingevuld.', 0.778349646798972), ([[544, 670], [922, 670], [922, 718], [544, 718]], 'gingen in de ondernemings- en', 0.9089413235356162), ([[992, 676], [1398, 676], [1398, 715], [992, 715]], 'vens of wijzigingen in de statuten:', 0.9142662651164867), ([[1432, 673], [1790, 673], [1790, 718], [1432, 718]], 'nemingen en rechtspersonen', 0.8465853524708478), ([[545, 715], [899, 715], [899, 754], [545, 754]], 'vestigingsgegevens opgeven.', 0.7804667854203973), ([[992, 712], [1359, 712], [1359, 748], [992, 748]], 'Daarvoor heeft u het formulier', 0.9333226651228688), ([[1430, 712], [1771, 712], [1771, 755], [1430, 755]], 'is verplicht op grond van de', 0.9973401912463694), ([[994, 748], [1340, 748], [1340, 790], [994, 790]], "'Wijziging vennootschaps- of", 0.9835336956859595), ([[1433, 750], [1677, 750], [1677, 789], [1433, 789]], 'Handelsregisterwet.', 0.9722827193654787), ([[1896, 750], [2124, 750], [2124, 789], [1896, 789]], 'Datum ontvangst', 0.9874414185157038), ([[545, 789], [808, 789], [808, 829], [545, 829]], 'Het kan gaan om een', 0.9728439940888028), ([[989, 789], [1368, 789], [1368, 828], [989, 828]], "rechtspersoongegevens' nodig:", 0.9486988029219867), ([[1433, 789], [1850, 789], [1850, 828], [1433, 828]], 'De gegevens die u op dit formulier', 0.9264211175173818), ([[545, 827], [715, 827], [715, 866], [545, 866]], 'wijziging van:', 0.999760229948895), ([[1430, 827], [1831, 827], [1831, 866], [1430, 866]], 'invult, worden opgenomen in het', 0.7579797971333079), ([[569, 861], [798, 861], [798, 902], [569, 902]], 'een handelsnaam;', 0.9354983770073168), ([[992, 863], [1113, 863], [1113, 902], [992, 902]], 'Vragen?', 0.9998577521748023), ([[1433, 863], [1826, 863], [1826, 903], [1433, 903]], 'Handelsregister. Dit is openbaar:', 0.7884380582728256), ([[569, 901], [954, 901], [954, 937], [569, 937]], 'een internetadres (www-adres);', 0.7805590652471217), ([[992, 901], [1362, 901], [1362, 937], [992, 937]], 'Kijk op KVKnl of bel de Kamer', 0.8739483574765413), ([[1433, 901], [1798, 901], [1798, 940], [1433, 940]], 'anderen kunnen uw gegevens', 0.781161744977421), ([[569, 937], [839, 937], [839, 976], [569, 976]], 'de bedrijfsactiviteiten;', 0.8728194735672185), ([[992, 936], [1384, 936], [1384, 978], [992, 978]], 'van Koophandel (KVK) als u nog', 0.9075128691602377), ([[1433, 940], [1847, 940], [1847, 978], [1433, 978]], 'natrekken en ook u kunt gegevens', 0.7539825691271265), ([[569, 973], [930, 973], [930, 1017], [569, 1017]], 'het adres of correspondentie-', 0.8615655799722887), ([[992, 974], [1368, 974], [1368, 1015], [992, 1015]], 'vragen heeft. Bijvoorbeeld over', 0.7223914006581907), ([[1433, 978], [1795, 978], [1795, 1014], [1433, 1014]], 'opvragen van ondernemingen', 0.9870294822997413), ([[568, 1012], [654, 1012], [654, 1051], [568, 1051]], 'adres;', 0.999997031312701), ([[992, 1014], [1340, 1014], [1340, 1050], [992, 1050]], 'het invullen van dit formulier', 0.9901984504096066), ([[1433, 1014], [1804, 1014], [1804, 1052], [1433, 1052]], 'waarmee u bijvoorbeeld zaken', 0.782222782261967), ([[1896, 1009], [2139, 1009], [2139, 1053], [1896, 1053]], 'Datum inschrijving', 0.7208259282818439), ([[569, 1048], [913, 1048], [913, 1088], [569, 1088]], 'het telefoon-, faxnummer of', 0.9877933221861724), ([[1434, 1053], [1556, 1053], [1556, 1084], [1434, 1084]], 'wilt doen.', 0.7609010399097881), ([[569, 1088], [726, 1088], [726, 1124], [569, 1124]], 'e-mailadres;', 0.9898517615830904), ([[992, 1090], [1354, 1090], [1354, 1126], [992, 1126]], 'Als u een vergissing maakt bij', 0.7862420307167383), ([[1432, 1084], [1832, 1084], [1832, 1133], [1432, 1133]], 'Zo draagt het Handelsregister bij', 0.8588092579614046), ([[569, 1126], [954, 1126], [954, 1162], [569, 1162]], 'het aantal werkzame personen;', 0.9940783615503822), ([[992, 1126], [1387, 1126], [1387, 1162], [992, 1162]], 'het invullen; dan kunt u het foute', 0.8983763846424127), ([[1430, 1126], [1688, 1126], [1688, 1162], [1430, 1162]], 'tot zeker zakendoen:', 0.7060427266745706), ([[569, 1159], [907, 1159], [907, 1204], [569, 1204]], 'opheffing of overdracht van', 0.9978238227742723), ([[993, 1165], [1320, 1165], [1320, 1197], [993, 1197]], 'antwoord doorhalen en het', 0.9739850544014607), ([[1897, 1165], [2074, 1165], [2074, 1197], [1897, 1197]], 'KVK-nummer', 0.9765328524034846), ([[569, 1199], [921, 1199], [921, 1242], [569, 1242]], 'de onderneming of vestiging:', 0.8827611291745454), ([[990, 1201], [1335, 1201], [1335, 1243], [990, 1243]], 'goede antwoord erbij zetten:', 0.8367759105473107), ([[545, 1236], [957, 1236], [957, 1278], [545, 1278]], 'U kunt dit formulier niet gebruiken', 0.9947663851384315), ([[991, 1236], [1404, 1236], [1404, 1278], [991, 1278]], 'Plaats hierbij wel uw handtekening:', 0.7712697939109996), ([[552, 1336], [573, 1336], [573, 1368], [552, 1368]], '1', 0.9998362131432401), ([[621, 1328], [1075, 1328], [1075, 1385], [621, 1385]], 'Inschrijfgegevens bij KVK', 0.7913680104989015), ([[112, 1416], [343, 1416], [343, 1471], [112, 1471]], 'Toelichting 1.1', 0.9665197442435908), ([[546, 1429], [585, 1429], [585, 1460], [546, 1460]], '1.1', 0.9101260751881599), ([[698, 1425], [789, 1425], [789, 1461], [698, 1461]], 'welke', 0.999977395675486), ([[787, 1416], [1707, 1416], [1707, 1472], [787, 1472]], 'onderneming of rechtspersoon wordt de wijziging opgegeven?', 0.8649164762631084), ([[114, 1461], [475, 1461], [475, 1506], [114, 1506]], 'Om de gewijzigde gegevens', 0.9945734635776975), ([[112, 1498], [477, 1498], [477, 1540], [112, 1540]], 'te kunnen doorvoeren, heeft', 0.8308951210369384), ([[113, 1532], [425, 1532], [425, 1581], [113, 1581]], 'KVK de gegevens nodig', 0.8539477367401248), ([[623, 1541], [709, 1541], [709, 1572], [623, 1572]], 'naam', 0.9998448491096497), ([[111, 1570], [473, 1570], [473, 1619], [111, 1619]], 'waaronder de onderneming', 0.9175567179099927), ([[112, 1612], [480, 1612], [480, 1655], [112, 1655]], 'of rechtspersoon staat inge-', 0.9706024251224752), ([[113, 1653], [245, 1653], [245, 1685], [113, 1685]], 'schreven:', 0.9995520407921188), ([[112, 1688], [452, 1688], [452, 1727], [112, 1727]], 'de naam; plaats van vesti-', 0.7291726366433087), ([[112, 1726], [436, 1726], [436, 1765], [112, 1765]], 'ging en inschrijfnummer.', 0.6917421918692422), ([[618, 1718], [922, 1718], [922, 1772], [618, 1772]], 'plaats van vestiging', 0.9094052735610817), ([[619, 1836], [984, 1836], [984, 1880], [619, 1880]], 'inschrijfnummer bij KVK', 0.8361044693870214), ([[549, 1933], [577, 1933], [577, 1970], [549, 1970]], '2', 1.0), ([[620, 1924], [887, 1924], [887, 1984], [620, 1984]], 'Soort wijziging', 0.9432145475493581), ([[112, 2016], [343, 2016], [343, 2071], [112, 2071]], 'Toelichting 2.1', 0.8589677715582387), ([[545, 2025], [589, 2025], [589, 2061], [545, 2061]], '2.1', 0.3331562578678131), ([[621, 2021], [982, 2021], [982, 2071], [621, 2071]], 'De wijzigingen betreffen', 0.6568526212805448), ([[114, 2062], [403, 2062], [403, 2102], [114, 2102]], 'U kunt op dit formulier', 0.7075532554106072), ([[114, 2102], [447, 2102], [447, 2141], [114, 2141]], 'wijzigingen opgeven in de', 0.9030469818915413), ([[657, 2099], [1236, 2099], [1236, 2144], [657, 2144]], 'de hoofdvestiging of de enige vestiging', 0.9419140625236007), ([[112, 2138], [477, 2138], [477, 2179], [112, 2179]], 'gegevens van één vestiging:', 0.8397015077006383), ([[131, 2176], [436, 2176], [436, 2215], [131, 2215]], 'de hoofdvestiging of de', 0.919350472671281), ([[697, 2172], [1010, 2172], [1010, 2221], [697, 2221]], 'één andere vestiging', 0.998114945881038), ([[128, 2214], [334, 2214], [334, 2253], [128, 2253]], 'enige vestiging;', 0.7957931561611861), ([[128, 2253], [310, 2253], [310, 2292], [128, 2292]], 'één vestiging:', 0.831584808084107), ([[744, 2246], [1201, 2246], [1201, 2295], [744, 2295]], 'het adres van deze vestiging is', 0.8176983426178079), ([[546, 2478], [594, 2478], [594, 2510], [546, 2510]], '2.2', 0.8481187224388123), ([[618, 2471], [2145, 2471], [2145, 2520], [618, 2520]], 'Kruis hier aan wat er is gewijzigd en ga door naar de aangegeven vraag (meerdere antwoorden mogelijk)', 0.7809722471670952), ([[657, 2547], [1130, 2547], [1130, 2592], [657, 2592]], 'handelsnaam of handelsnamen', 0.9969339882359225), ([[1496, 2551], [1708, 2551], [1708, 2590], [1496, 2590]], 'Ga naar vraag 3', 0.9953507898783989), ([[656, 2620], [1317, 2620], [1317, 2668], [656, 2668]], 'bedrijfsactiviteiten; diensten en/of producten', 0.7978056315421745), ([[1496, 2628], [1708, 2628], [1708, 2664], [1496, 2664]], 'Ga naar vraag 4', 0.7169601715113014), ([[656, 2697], [1435, 2697], [1435, 2745], [656, 2745]], 'activiteiten van een rechtspersoon zonder onderneming', 0.9387298802978001), ([[1496, 2702], [1708, 2702], [1708, 2741], [1496, 2741]], 'Ga naar vraag 4', 0.9985427968075652), ([[656, 2769], [1164, 2769], [1164, 2818], [656, 2818]], 'adres en/of correspondentieadres', 0.889592751275233), ([[1496, 2779], [1710, 2779], [1710, 2815], [1496, 2815]], 'Ga naar vraag 5', 0.9997755335174974), ([[656, 2845], [1060, 2845], [1060, 2894], [656, 2894]], 'internetadres (www-adres)', 0.7284716366023437), ([[1496, 2853], [1708, 2853], [1708, 2892], [1496, 2892]], 'Ga naar vraag 6', 0.9986732443390018), ([[655, 2919], [1429, 2919], [1429, 2969], [655, 2969]], 'telefoon-, faxnummer; e-mailadres; berichtenboxnaam', 0.8868889771133504), ([[1496, 2927], [1686, 2927], [1686, 2966], [1496, 2966]], 'Ga naar vraag', 0.9976082940935516), ([[657, 2998], [1061, 2998], [1061, 3040], [657, 3040]], 'aantal werkzame personen', 0.8325593660182602), ([[1496, 3004], [1708, 3004], [1708, 3040], [1496, 3040]], 'Ga naar vraag 8', 0.7877077994100569), ([[656, 3069], [1010, 3069], [1010, 3121], [656, 3121]], 'opheffing of overdracht', 0.9805956245628613), ([[1496, 3078], [1708, 3078], [1708, 3117], [1496, 3117]], 'Ga naar vraag 9', 0.9980991381753203), ([[115, 3408], [901, 3408], [901, 3440], [115, 3440]], 'Kamer van Koophandel@ juni 2020 Wijziging ondernemings- en vestigingsgegevens', 0.7441849892761688), ([[2278, 3408], [2326, 3408], [2326, 3436], [2278, 3436]], 'blad', 0.9999856948852539), ([[2342, 3414], [2396, 3414], [2396, 3435], [2342, 3435]], 'van 4', 0.9974790653868143), ([[624.0298574998546, 1419.1194299994186], [700.8096965887974, 1429.7808970915848], [694.9701425001454, 1466.8805700005814], [618.1903034112026, 1456.2191029084152]], 'Voor', 0.9999222159385681)] At first sight only the @ is wrong, as it should be an R of registered. |
Playing around with the new You.com YouChat, which is free to use at the moment you can ask questions which are answered ChatGPT-like, but including references and actual results from a websearch, I found this article on document segmentation: |
Right, the hOCR results basically contain the results of the Tesseract segmentation, so we wouldn't have to re-do that. |
This form https://www.kvk.nl/download/Formulier-14-wijziging-ondernemings-en-vestigingsgegevens_tcm109-365607.pdf
First page saved to jpeg via this site: https://smallpdf.com
Result of the left column is quite readable at the right screen-resolution.
formulierhocrjpgkleiner.pdf
Contains unreadable text on the left. The hocr contains "Toelichting 1.1", it is completely unreadable.
My patch for the inversion ratio makes it better readable:
formulierhocrjpgkleinerpatch.pdf
However if you lookup the mask-picture it doesn't contain this text in the left column at all.
So my patch isn't the only needed change for that routine.
The text was updated successfully, but these errors were encountered: