Skip to content
4F2E4A2E edited this page Feb 11, 2015 · 6 revisions

This page in under construction.
Please be patient, we are gathering all the information together along testing.

Tess4J is being developed and tested with Java 32-bit on Windows and Linux.

Instructions

The Tesseract OCR DLL file, language data for English, and sample images are bundled with the program. Language data packs for Tesseract should be decompressed and placed into the tessdata folder.

Update: For 64-bit JVM on Windows, try the 64-bit Tesseract and Leptonica DLLs from Tesseract .NET wrapper project. These DLLs were built with VS2012/VS2013 and therefore depend on the Visual C++ Redistributable for VS2012 or Visual C++ Redistributable for VS2013.

The Linux shared object library (libtesseract.so) equivalent to the DLL is available in Tesseract 3.02, which can be built from the source with the instructions given in Tesseract Wiki.

Tess4J can be built and unit tested using Apache Ant and JUnit. Unzip the source and execute at the command line:

ant test

Notes: On platforms that do not have UTF-8 as their default charset, the output text may have character encoding issues. You may need to set the default character encoding for your program that calls Tess4J by supplying the JVM with the command-line option -Dfile.encoding=UTF8 or setting the environment variable JAVA_TOOL_OPTIONS to -Dfile.encoding=UTF8 for version 1.0. This is no longer needed for version 1.1.

Support for PDF documents is available through GPL Ghostscript, which should be installed and included in system path.

Images intended for OCR should have at least 200 DPI in resolution, typically 300 DPI, 1 bpp (bit per pixel) monochome or 8 bpp grayscale uncompressed TIFF or PNG format. PNG is usually smaller in size than other image formats and still keeps high quality due to its employing lossless data compression algorithms; TIFF has the advantage of the ability to contain multiple images (pages) in a file.

References

Clone this wiki locally