Building from source using autotools - config files not being found #2848

Shreeshrii · 2020-01-06T15:30:24Z

When building from GitHub source with autotools, (make, install) the config files do not seem to get installed.

There have been reports in the past of missing 'lstm.train' etc. Recently I got a report from a user of missing 'wordstrbox' file. Her tesseract version is

tesseract 5.0.0-alpha-582-g60b07

Where should I expect these files be installed when building from master?

amitdo · 2020-01-06T16:07:33Z

Linux: In /usr/local/share/tessdata
Unless you used ./configure --prefix=/usr

Shreeshrii · 2020-01-07T06:53:28Z

@amitdo Thanks. You are right, The files are there in /usr/local/share/tessdata.

Then why errors such as the following when running tesseract.

read_params_file: cant open wordstrbox

A google search shows multiple issues related to this. Does this mean that read_params is not looking in the directory where the config files have been installed by the build process?

zdenop · 2020-01-07T06:57:03Z

Which command show this error? Without full details there is no reasonable reason to post issue.

Shreeshrii · 2020-01-07T06:59:04Z

user report ...

tesseract 5.0.0-alpha-582-g60b07

tesseract $my_file        ${my_file%.*} -l hin --psm 6 wordstrbox

generate a txt file instead of box and also gives this message

read_params_file: cant open wordstrbox

amitdo · 2020-01-07T07:29:28Z

Did you set TESSDATA_PREFIX before the tesseract command?

amitdo · 2020-01-07T07:43:56Z

Shree, here's something I found
Shreeshrii/tessdata_ocrb#1

So your answers were similar to mine :)

Shreeshrii · 2020-01-07T08:10:30Z

Thanks, @amitdo . Yes, it is probably related to there being another older tessdata directory somewhere on the user's system.

When the autotools build and install is done and tesseract is run after that, it should find the tessdata and configs installed by the build. Does the build set a default value for TESSDATA_PREFIX?

What is the order of directories checked for tessdata files?

amitdo · 2020-01-07T08:24:42Z

The "magic" is done here:
https://github.com/tesseract-ocr/tesseract/blob/cb0c024a6f9/src/ccutil/mainblk.cpp

Shreeshrii · 2020-01-07T10:22:52Z

@param argv0 - paths to the directory with language files and config files.

An actual value of argv0 is used if not nullptr, otherwise TESSDATA_PREFIX is

used if not nullptr, next try to use compiled in -DTESSDATA_PREFIX. If previous

is not successful - use current directory.

Is -DTESSDATA_PREFIX = /usr/local/share/tessdata on Linux?

Shreeshrii · 2020-01-07T10:25:25Z

--tessdata-dir from CLI
TESSDATA_PREFIX from env
compiled in -DTESSDATA_PREFIX (/usr/local/share/tessdata ?)
current directory

Shreeshrii · 2020-01-07T13:36:50Z

I think that wiki instructions need some change.

While the config files are installed by build process in /usr/local/share/tessdata, none of the traineddata files are installed. Tesseract won't work without osd.traineddata and eng.traineddata.

https://github.com/tesseract-ocr/tesseract/wiki/Compiling-%E2%80%93-GitInstallation#post-install-instructions say

Once installation is complete, don't forget to do the following!:

Set a local variable called TESSDATA_PREFIX to point to the tesseract tessdata directory.

Ex: on Linux Ubuntu, modify your ~/.bashrc file by adding the following to the bottom of it. Modify the path according to your situation:

export TESSDATA_PREFIX="/home/$USER/Downloads/tesseract/tesseract-4.1.0/tessdata" 
Then, close and re-open your terminal for it to take effect, or just call . ~/.bashrc or export ~/.bashrc (same thing) for it to take effect immediately in your current terminal.

Place any language training data you need into this tessdata folder as well. For example, the English one is called eng.traineddata. Download it from the tessdata repository here, and move it to your tessdata directory you just specified in your TESSDATA_PREFIX variable above.

Here is the direct download link for eng.traineddata.
Now you are ready to use tesseract!

However using a TESSDATA_PREFIX which only has eng.traineddata but not the config files will lead to errors if users try using any of the config files eg. tsv, hocr, pdf, wordstrbox etc.

Shreeshrii · 2020-01-12T13:43:17Z

Please see if the following instructions are correct and will solve this problem.

cd tesseract
./autogen.sh
./configure
make
sudo make install
sudo ldconfig
make training
sudo make training-install

There are two parts to install for Tesseract, the engine itself, and the training data for a language.
The above commands install the tesseract engine and training tools. They also install the config files needed for output such as pdf, tsv, hocr, alto, etc.

In addition to these, traineddata for a language is needed to recognize the text in images. Three types of traineddata files for over 130 languages and over 35 scripts are available in tesseract-ocr GitHub repos.

When building from source on Linux, the tessdata configs will be installed in /usr/local/share/tessdata.

Set a local variable called TESSDATA_PREFIX to point to this tesseract tessdata directory, eg: on Linux Ubuntu, modify your ~/.bashrc file by adding the following to the bottom of it. Modify the path according to your situation:
```
      `export TESSDATA_PREFIX="/usr/local/share/tessdata" `
```
Then, download eng.traineddata (and whatever other language(s) you need) to this tesseract tessdata directory.

If you want support for both the legacy (--oem 0) and LSTM (--oem 1) engine, download the traineddata from https://github.com/tesseract-ocr/tessdata.

Use traineddata from https://github.com/tesseract-ocr/tessdata_best or traineddata from https://github.com/tesseract-ocr/tessdata_fast if you only want support for LSTM engine (--oem 1).

Please make sure to use the download link or wget the raw file.

zdenop · 2020-01-12T13:57:44Z

My 2 cents:
If you are on linux and you have no clue about tesseract details - your package management of your distribution

For more advance users PR #2459 should be solution.

amitdo · 2020-01-12T14:04:03Z

Set a local variable called TESSDATA_PREFIX to point to this tesseract tessdata directory,

It's only needed if you want to put the traineddata in a different directory than the directory that was defined during installation. As I said, you can change the default path with the --prefix parameter.

Shreeshrii · 2020-01-12T16:01:36Z

@amitdo Thanks for the clarification. I will remove that line.

@zdenop I think when installing from a Linux distribution, eng and osd.traineddata are also installed along with the configs.

The problem I see is when someone builds from source on GitHub. Then build and install will put config files as defined during installation. The post install instructions in wiki refer to TESSDATA_PREFIX and downloading of traineddata files to that directory. This will lead to the config files being in a different directory than traineddata.

I would like to improve the instructions to avoid this issue.

Shreeshrii · 2020-01-12T16:08:35Z

@zdenop I thought you had written a utility for downloading of traineddata files. Please add a link to the same from the wiki.

zdenop · 2020-01-13T06:56:38Z

@Shreeshrii : I understand your aim, but if someone is installing from source they should read the logs - make install will report what and where is installed. If they do not read it then why they should something else ;-)? I am more in favor to support packager that solve problems for this kind of users...

My script for downloading traineddata files is here: https://github.com/zdenop/tessdata_downloader

Shreeshrii · 2020-01-14T13:08:52Z

Made changes in wiki - see https://github.com/tesseract-ocr/tesseract/wiki/Compiling-%E2%80%93-GitInstallation/_compare/24c668f50c5879f13c02aa05aad48d716914b7b2...014c6ef722ae2c495c17bd6f6c5c510fd936c4bb

Shreeshrii changed the title ~~Building from source not installing config files~~ Building from source using autotools - config files not being found Jan 7, 2020

Shreeshrii closed this as completed Jan 14, 2020

amitdo added the question label May 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Building from source using autotools - config files not being found #2848

Building from source using autotools - config files not being found #2848

Shreeshrii commented Jan 6, 2020

amitdo commented Jan 6, 2020

Shreeshrii commented Jan 7, 2020 •

edited

Loading

zdenop commented Jan 7, 2020

Shreeshrii commented Jan 7, 2020 •

edited

Loading

amitdo commented Jan 7, 2020

amitdo commented Jan 7, 2020

Shreeshrii commented Jan 7, 2020

amitdo commented Jan 7, 2020

Shreeshrii commented Jan 7, 2020

Shreeshrii commented Jan 7, 2020 •

edited

Loading

Shreeshrii commented Jan 7, 2020

Shreeshrii commented Jan 12, 2020 •

edited

Loading

zdenop commented Jan 12, 2020

amitdo commented Jan 12, 2020 •

edited

Loading

Shreeshrii commented Jan 12, 2020

Shreeshrii commented Jan 12, 2020

zdenop commented Jan 13, 2020

Shreeshrii commented Jan 14, 2020

Building from source using autotools - config files not being found #2848

Building from source using autotools - config files not being found #2848

Comments

Shreeshrii commented Jan 6, 2020

amitdo commented Jan 6, 2020

Shreeshrii commented Jan 7, 2020 • edited Loading

zdenop commented Jan 7, 2020

Shreeshrii commented Jan 7, 2020 • edited Loading

amitdo commented Jan 7, 2020

amitdo commented Jan 7, 2020

Shreeshrii commented Jan 7, 2020

amitdo commented Jan 7, 2020

Shreeshrii commented Jan 7, 2020

Shreeshrii commented Jan 7, 2020 • edited Loading

Shreeshrii commented Jan 7, 2020

Shreeshrii commented Jan 12, 2020 • edited Loading

zdenop commented Jan 12, 2020

amitdo commented Jan 12, 2020 • edited Loading

Shreeshrii commented Jan 12, 2020

Shreeshrii commented Jan 12, 2020

zdenop commented Jan 13, 2020

Shreeshrii commented Jan 14, 2020

Shreeshrii commented Jan 7, 2020 •

edited

Loading

Shreeshrii commented Jan 7, 2020 •

edited

Loading

Shreeshrii commented Jan 7, 2020 •

edited

Loading

Shreeshrii commented Jan 12, 2020 •

edited

Loading

amitdo commented Jan 12, 2020 •

edited

Loading