Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get invalid box information for some characters when using hocr options #1146

Closed
shawnchen8255 opened this issue Sep 21, 2017 · 6 comments
Closed
Labels
duplicate output issues related output formats

Comments

@shawnchen8255
Copy link

shawnchen8255 commented Sep 21, 2017

Environment

  • Tesseract Version: 4.0.0
  • Commit Number:
  • Platform: Ubuntu 17.04

Current Behavior:

When using HOCR in command line some characters will be recognized with the same box information as (0,0,width,height) of the total image which is invalid.
The position information is invalid for these characters

Expected Behavior:

Suggested Fix:

@vspol
Copy link

vspol commented Oct 10, 2017

Same problem.
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica / windows 10

Used command line:
"C:\Program Files (x86)\Tesseract-OCR\tesseract.exe" "C:...\doc.tif" "C:...\hocr" -l rus+eng -psm 1 hocr

Part of hocr-document:

<span class='ocr_line' id='line_1_78' title="bbox 509 1652 2713 1682; baseline 0 -5; x_size 27.964552; x_descenders 5.5; x_ascenders 5.8769231"><span class='ocrx_word' id='word_1_397' title='bbox 0 0 3610 2628; x_wconf 69'><strong>Переходник</strong></span> <span class='ocrx_word' id='word_1_398' title='bbox 0 0 3610 2628; x_wconf 71'><strong>РМС</strong></span> <span class='ocrx_word' id='word_1_399' title='bbox 509 1652 1086 1682; x_wconf 96'><strong>симметричный</strong></span> <span class='ocrx_word' id='word_1_400' title='bbox 0 0 3610 2628; x_wconf 95' lang='eng'><strong>400/200</strong> ...

word_1_397, word_1_398, word_1_400 - have size of page - bbox 0 0 3610 2628
word_1_399 - have size and coordinates of five words - bbox 509 1652 1086 1682

graphical view

Text line box have correct coordinates:
image

but coordinates of some words are not correct:

image

image

@napasa
Copy link

napasa commented Feb 5, 2018

Same Problem.

@devendrasr
Copy link

I am running into the same problem. How can we overcome this?

@willaaam
Copy link

willaaam commented Sep 3, 2018

Crosspost - but this issue is still very much relevant.

#1192 (comment)

@zdenop
Copy link
Contributor

zdenop commented Sep 28, 2018

can you please provide image for testing issue? + Test the current code.

@amitdo
Copy link
Collaborator

amitdo commented Oct 15, 2018

Same issue as #1192

@zdenop zdenop added duplicate output issues related output formats labels Oct 15, 2018
@zdenop zdenop closed this as completed Oct 15, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
duplicate output issues related output formats
Projects
None yet
Development

No branches or pull requests

7 participants