-
Notifications
You must be signed in to change notification settings - Fork 152
[Libtesseract] Reduce calls to tesseract_raw.init() #89
Comments
The problem here is that So I think this change will imply changing the API in non-backward-compatible way. The API is the same for all the modules, so it will have to be changed on all the others too. |
My own patch was add a option input kward to 85 -def image_to_string(image, lang=None, builder=None):
85 +def image_to_string(image, lang=None, builder=None, tesseract_raw_handle=None):
86
86 if builder is None:
87
87 builder = builders.TextBuilder()
88 - handle = tesseract_raw.init(lang=lang)
88 + if tesseract_raw_handle is None:
89 + handle = tesseract_raw.init(lang=lang)
90 + else:
91 + handle = tesseract_raw_handle
89
92
90
93 lvl_line = tesseract_raw.PageIteratorLevel.TEXTLINE
91
94 lvl_word = tesseract_raw.PageIteratorLevel.WORD
92
95
93
96 try:
94 - # XXX(Jflesch): Issue #51:
95 - # Tesseract TessBaseAPIRecognize() may segfault when the target
96 - # language is not available
97 - clang = lang if lang else "eng"
98 - for lang_item in clang.split("+"):
99 - if lang_item not in tesseract_raw.get_available_languages(handle):
100 - raise TesseractError(
101 - "no lang",
102 - "language {} is not available".format(lang_item)
103 - )
97 + if tesseract_raw_handle is None:
98 + # XXX(Jflesch): Issue #51:
99 + # Tesseract TessBaseAPIRecognize() may segfault when the target
100 + # language is not available
101 + clang = lang if lang else "eng"
102 + for lang_item in clang.split("+"):
103 + if lang_item not in tesseract_raw.get_available_languages(handle):
104 + raise TesseractError(
105 + "no lang",
106 + "language {} is not available".format(lang_item)
107 + )
104
108
105
109 tesseract_raw.set_page_seg_mode(
106
110 handle, builder.tesseract_layout
... ...
@@ -159,7 +163,8 @@ def image_to_string(image, lang=None, builder=None):
159
163 break
160
164
161
165 finally:
162 - tesseract_raw.cleanup(handle)
166 + if tesseract_raw_handle is None:
167 + tesseract_raw.cleanup(handle)
163
168
164
169 return builder.get_output() add I init and cleanup the handle by myself tesseract_raw_handle = libtesseract.tesseract_raw.init("eng")
try:
for image in images:
libtesseract.image_to_string(
image,
lang="eng",
builder=builders.DigitBuilder(7),
tesseract_raw_handle=tesseract_raw_handle
)
finally:
libtesseract.tesseract_raw.cleanup(tesseract_raw_handle) |
maybe add a new class base api like |
Interresting idea. But still a new API. So I'll consider it, but for a next major new version (PyOCR2 :). |
add a note, before we want to reuse the |
When I use the image_to_string() function frequently, I find the tesseract_raw.init()'s call use the most of CPU times (by pstat). Then I read the code about image_to_string() found it call init() to get libtesseract handle each time when call. This is a advise that could use a threadlocal based cache or a class based cache the libtesseract handle to reuse that and I supposed it can make program run faster.
Thanks.
The text was updated successfully, but these errors were encountered: