Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Building from source using autotools - config files not being found #2848

Closed
Shreeshrii opened this issue Jan 6, 2020 · 18 comments
Closed
Labels

Comments

@Shreeshrii
Copy link
Collaborator

When building from GitHub source with autotools, (make, install) the config files do not seem to get installed.

There have been reports in the past of missing 'lstm.train' etc. Recently I got a report from a user of missing 'wordstrbox' file. Her tesseract version is

tesseract 5.0.0-alpha-582-g60b07

Where should I expect these files be installed when building from master?

@amitdo
Copy link
Collaborator

amitdo commented Jan 6, 2020

Linux: In /usr/local/share/tessdata
Unless you used ./configure --prefix=/usr

@Shreeshrii Shreeshrii changed the title Building from source not installing config files Building from source using autotools - config files not being found Jan 7, 2020
@Shreeshrii
Copy link
Collaborator Author

Shreeshrii commented Jan 7, 2020

@amitdo Thanks. You are right, The files are there in /usr/local/share/tessdata.

Then why errors such as the following when running tesseract.

read_params_file: cant open wordstrbox

A google search shows multiple issues related to this. Does this mean that read_params is not looking in the directory where the config files have been installed by the build process?

@zdenop
Copy link
Contributor

zdenop commented Jan 7, 2020

Which command show this error? Without full details there is no reasonable reason to post issue.

@Shreeshrii
Copy link
Collaborator Author

Shreeshrii commented Jan 7, 2020

user report ...

tesseract 5.0.0-alpha-582-g60b07

tesseract $my_file        ${my_file%.*} -l hin --psm 6 wordstrbox

generate a txt file instead of box and also gives this message

read_params_file: cant open wordstrbox

@amitdo
Copy link
Collaborator

amitdo commented Jan 7, 2020

Did you set TESSDATA_PREFIX before the tesseract command?

@amitdo
Copy link
Collaborator

amitdo commented Jan 7, 2020

Shree, here's something I found
Shreeshrii/tessdata_ocrb#1

So your answers were similar to mine :)

@Shreeshrii
Copy link
Collaborator Author

Thanks, @amitdo . Yes, it is probably related to there being another older tessdata directory somewhere on the user's system.

When the autotools build and install is done and tesseract is run after that, it should find the tessdata and configs installed by the build. Does the build set a default value for TESSDATA_PREFIX?

What is the order of directories checked for tessdata files?

@amitdo
Copy link
Collaborator

amitdo commented Jan 7, 2020

@Shreeshrii
Copy link
Collaborator Author

  • @param argv0 - paths to the directory with language files and config files.
  • An actual value of argv0 is used if not nullptr, otherwise TESSDATA_PREFIX is
  • used if not nullptr, next try to use compiled in -DTESSDATA_PREFIX. If previous
  • is not successful - use current directory.

Is -DTESSDATA_PREFIX = /usr/local/share/tessdata on Linux?

@Shreeshrii
Copy link
Collaborator Author

Shreeshrii commented Jan 7, 2020

  1. --tessdata-dir from CLI
  2. TESSDATA_PREFIX from env
  3. compiled in -DTESSDATA_PREFIX (/usr/local/share/tessdata ?)
  4. current directory

@Shreeshrii
Copy link
Collaborator Author

I think that wiki instructions need some change.

While the config files are installed by build process in /usr/local/share/tessdata, none of the traineddata files are installed. Tesseract won't work without osd.traineddata and eng.traineddata.

https://github.com/tesseract-ocr/tesseract/wiki/Compiling-%E2%80%93-GitInstallation#post-install-instructions say

Once installation is complete, don't forget to do the following!:

Set a local variable called TESSDATA_PREFIX to point to the tesseract tessdata directory.

Ex: on Linux Ubuntu, modify your ~/.bashrc file by adding the following to the bottom of it. Modify the path according to your situation:

export TESSDATA_PREFIX="/home/$USER/Downloads/tesseract/tesseract-4.1.0/tessdata" 
Then, close and re-open your terminal for it to take effect, or just call . ~/.bashrc or export ~/.bashrc (same thing) for it to take effect immediately in your current terminal.

Place any language training data you need into this tessdata folder as well. For example, the English one is called eng.traineddata. Download it from the tessdata repository here, and move it to your tessdata directory you just specified in your TESSDATA_PREFIX variable above.

Here is the direct download link for eng.traineddata.
Now you are ready to use tesseract!

However using a TESSDATA_PREFIX which only has eng.traineddata but not the config files will lead to errors if users try using any of the config files eg. tsv, hocr, pdf, wordstrbox etc.

@Shreeshrii
Copy link
Collaborator Author

Shreeshrii commented Jan 12, 2020

Please see if the following instructions are correct and will solve this problem.

cd tesseract
./autogen.sh
./configure
make
sudo make install
sudo ldconfig
make training
sudo make training-install

There are two parts to install for Tesseract, the engine itself, and the training data for a language.
The above commands install the tesseract engine and training tools. They also install the config files needed for output such as pdf, tsv, hocr, alto, etc.

In addition to these, traineddata for a language is needed to recognize the text in images. Three types of traineddata files for over 130 languages and over 35 scripts are available in tesseract-ocr GitHub repos.

When building from source on Linux, the tessdata configs will be installed in /usr/local/share/tessdata.

  • Set a local variable called TESSDATA_PREFIX to point to this tesseract tessdata directory, eg: on Linux Ubuntu, modify your ~/.bashrc file by adding the following to the bottom of it. Modify the path according to your situation:

          `export TESSDATA_PREFIX="/usr/local/share/tessdata" `
    
  • Then, download eng.traineddata (and whatever other language(s) you need) to this tesseract tessdata directory.

If you want support for both the legacy (--oem 0) and LSTM (--oem 1) engine, download the traineddata from https://github.com/tesseract-ocr/tessdata.

Use traineddata from https://github.com/tesseract-ocr/tessdata_best or traineddata from https://github.com/tesseract-ocr/tessdata_fast if you only want support for LSTM engine (--oem 1).

Please make sure to use the download link or wget the raw file.

@zdenop
Copy link
Contributor

zdenop commented Jan 12, 2020

My 2 cents:
If you are on linux and you have no clue about tesseract details - your package management of your distribution

For more advance users PR #2459 should be solution.

@amitdo
Copy link
Collaborator

amitdo commented Jan 12, 2020

Set a local variable called TESSDATA_PREFIX to point to this tesseract tessdata directory,

It's only needed if you want to put the traineddata in a different directory than the directory that was defined during installation. As I said, you can change the default path with the --prefix parameter.

@Shreeshrii
Copy link
Collaborator Author

@amitdo Thanks for the clarification. I will remove that line.

@zdenop I think when installing from a Linux distribution, eng and osd.traineddata are also installed along with the configs.

The problem I see is when someone builds from source on GitHub. Then build and install will put config files as defined during installation. The post install instructions in wiki refer to TESSDATA_PREFIX and downloading of traineddata files to that directory. This will lead to the config files being in a different directory than traineddata.

I would like to improve the instructions to avoid this issue.

@Shreeshrii
Copy link
Collaborator Author

@zdenop I thought you had written a utility for downloading of traineddata files. Please add a link to the same from the wiki.

@zdenop
Copy link
Contributor

zdenop commented Jan 13, 2020

@Shreeshrii : I understand your aim, but if someone is installing from source they should read the logs - make install will report what and where is installed. If they do not read it then why they should something else ;-)? I am more in favor to support packager that solve problems for this kind of users...

My script for downloading traineddata files is here: https://github.com/zdenop/tessdata_downloader

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants