-
Notifications
You must be signed in to change notification settings - Fork 9.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Building from source using autotools - config files not being found #2848
Comments
Linux: In |
@amitdo Thanks. You are right, The files are there in Then why errors such as the following when running tesseract.
A google search shows multiple issues related to this. Does this mean that |
Which command show this error? Without full details there is no reasonable reason to post issue. |
user report ... tesseract 5.0.0-alpha-582-g60b07
generate a txt file instead of box and also gives this message read_params_file: cant open wordstrbox |
Did you set |
Shree, here's something I found So your answers were similar to mine :) |
Thanks, @amitdo . Yes, it is probably related to there being another older tessdata directory somewhere on the user's system. When the autotools build and install is done and tesseract is run after that, it should find the tessdata and configs installed by the build. Does the build set a default value for TESSDATA_PREFIX? What is the order of directories checked for tessdata files? |
The "magic" is done here: |
Is |
|
I think that wiki instructions need some change. While the config files are installed by build process in /usr/local/share/tessdata, none of the traineddata files are installed. Tesseract won't work without osd.traineddata and eng.traineddata.
However using a TESSDATA_PREFIX which only has eng.traineddata but not the config files will lead to errors if users try using any of the config files eg. tsv, hocr, pdf, wordstrbox etc. |
Please see if the following instructions are correct and will solve this problem.
There are two parts to install for Tesseract, the engine itself, and the training data for a language. In addition to these, traineddata for a language is needed to recognize the text in images. Three types of traineddata files for over 130 languages and over 35 scripts are available in tesseract-ocr GitHub repos. When building from source on Linux, the tessdata configs will be installed in
If you want support for both the legacy (--oem 0) and LSTM (--oem 1) engine, download the traineddata from https://github.com/tesseract-ocr/tessdata. Use traineddata from https://github.com/tesseract-ocr/tessdata_best or traineddata from https://github.com/tesseract-ocr/tessdata_fast if you only want support for LSTM engine (--oem 1). Please make sure to use the download link or wget the raw file. |
My 2 cents: For more advance users PR #2459 should be solution. |
It's only needed if you want to put the traineddata in a different directory than the directory that was defined during installation. As I said, you can change the default path with the --prefix parameter. |
@amitdo Thanks for the clarification. I will remove that line. @zdenop I think when installing from a Linux distribution, eng and osd.traineddata are also installed along with the configs. The problem I see is when someone builds from source on GitHub. Then build and install will put config files as defined during installation. The post install instructions in wiki refer to TESSDATA_PREFIX and downloading of traineddata files to that directory. This will lead to the config files being in a different directory than traineddata. I would like to improve the instructions to avoid this issue. |
@zdenop I thought you had written a utility for downloading of traineddata files. Please add a link to the same from the wiki. |
@Shreeshrii : I understand your aim, but if someone is installing from source they should read the logs - My script for downloading traineddata files is here: https://github.com/zdenop/tessdata_downloader |
When building from GitHub source with autotools, (make, install) the config files do not seem to get installed.
There have been reports in the past of missing 'lstm.train' etc. Recently I got a report from a user of missing 'wordstrbox' file. Her tesseract version is
tesseract 5.0.0-alpha-582-g60b07
Where should I expect these files be installed when building from master?
The text was updated successfully, but these errors were encountered: