-
Notifications
You must be signed in to change notification settings - Fork 375
Usage
Tess4J is being developed and tested with Java 32-bit on Windows and Linux.
The Tesseract OCR DLL file, language data for English, and sample images are bundled with the program. Language data packs for Tesseract should be decompressed and placed into the tessdata folder.
Update: For 64-bit JVM on Windows, try the 64-bit Tesseract and Leptonica DLLs from Tesseract .NET wrapper project. These DLLs were built with VS2012/VS2013 and therefore depend on the Visual C++ Redistributable for VS2012 or Visual C++ Redistributable for VS2013.
The Linux shared object library (libtesseract.so) equivalent to the DLL is available in Tesseract 3.02, which can be built from the source with the instructions given in Tesseract Wiki.
Tess4J can be built and unit tested using Apache Ant and JUnit. Unzip the source and execute at the command line:
ant test
Notes: On platforms that do not have UTF-8 as their default charset, the output text may have character encoding issues. You may need to set the default character encoding for your program that calls Tess4J by supplying the JVM with the command-line option -Dfile.encoding=UTF8 or setting the environment variable JAVA_TOOL_OPTIONS to -Dfile.encoding=UTF8 for version 1.0. This is no longer needed for version 1.1.
Support for PDF documents is available through GPL Ghostscript, which should be installed and included in system path.
Images intended for OCR should have at least 200 DPI in resolution, typically 300 DPI, 1 bpp (bit per pixel) monochome or 8 bpp grayscale uncompressed TIFF or PNG format. PNG is usually smaller in size than other image formats and still keeps high quality due to its employing lossless data compression algorithms; TIFF has the advantage of the ability to contain multiple images (pages) in a file.
ℹ️ Missing an information on this page? Please create an issue and tell us about it.