-
Notifications
You must be signed in to change notification settings - Fork 9.5k
Planning
Here we can plan the next releases of Tesseract.
That will be the next release.
See also the discussion for issue #1423.
-
Issues with the "bug" label (see list here)
-
Segmentation fault when training from .lstm extracted from tessdata/eng.traineddata Issue 1573
-
Report a warning when the Tesseract initialisation code detects an unsupported locale setting. (See comment)
-
Segfault on using -psm 0 when using fast eng.traineddata Issue 1167
-
combine_lang_model does not print correct usage help Issue 1375
-
Insufficient error message when output file cannot be created Issue 1424
-
“no best words!!” on mixed language (fra+ara) items (see issue 235)
-
Tesseract creates output for missing input (see issue 1023)
-
mgr_.Init(traineddata_path.c_str()):Error:Assert failed: #1075 (see issue 1075)
-
Some images translated to text using Tesseract 4 throw an error ... (see issue 1205)
-
This does not include OpenCL or the old Tesseract engine. Some recent commits already removed such code, for example API include files, so it could be good enough for 4.0.0.
-
Script for installing only selected languages from github (see issue)
Depending on available resources and opinions, these suggestions will either be added to the planning for the next or a future release or abandoned.
-
Enhance --list-langs to show additional information for scripts and languages like legacy / LSTM, version
This will make the command slower, because each file must be opened and parsed. Add this as --list-langs-details or as --list-lang-details for one language file based on lang-code?
-
tessedit_load_sublangs should search for the sublangs relative to the parent, not starting in tessdata dir.
-
In addition to the current proprietary format Tesseract could also support ZIP archives (see discussion). A possible implementation using libarchive is available, but needs more testing.
-
"Training light" - Learning by doing (see issue)
-
Modify text2image to use PrepareDistortedPix() #1052
-
Fix #736 - text2image segfault on macOS
Tesseract 4.0 should be a full replacement for Tesseract 3.05 and have the same features when used with the old OCR engine (--oem 0
). The following regressions still need verification (are they really regressions, or are they just missing features for LSTM):
These features still work with the old OCR engine (--oem 0
), but are missing and desired for LSTM.
-
Black list / White list (See issue). Here is a workaround.
-
hOCR font info (See comment)
Here we collect important issues and features for the release(s) following 4.0.0.
-
New LSTM-based OSD detector (see comment).
-
Remove Legacy Tesseract Engine (see issue)
-
Better Multi-language implementation for training (See comment)
-
ARM SIMD support for dot product #519
-
Using OpenMP for dot product #983
-
Issue 1353: Patch for /training/tessopt.cpp (see pull request 13)
It looks like it is not possible to run more than one training in the same process. The pull request describes a possible fix, but does not include a complete implementation (low priority).
Old wiki - no longer maintained. The pages were moved, see the new documentation.
These wiki pages are no longer maintained.
All pages were moved to tesseract-ocr/tessdoc.
The latest documentation is available at https://tesseract-ocr.github.io/.