-
Notifications
You must be signed in to change notification settings - Fork 21
/
readme.html
111 lines (111 loc) · 6.14 KB
/
readme.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>jTessBoxEditorFX - Box Editor & Trainer for Tesseract OCR Data</title>
<style type="text/css">
.auto-style1
{
text-decoration: underline;
}
</style>
</head>
<body lang="EN-US">
<div>
<h2 style="text-align: center;">
jTessBoxEditorFX</h2>
<h3>
DESCRIPTION</h3>
<p>
<a href="http://vietocr.sourceforge.net/training.html">jTessBoxEditorFX</a> is a box
editor and trainer for <a href="https://github.com/tesseract-ocr">Tesseract OCR</a>,
providing editing of box data of both Tesseract 2.0x and 3.0x formats and full automation
of Tesseract training. It can read common image formats, including multi-page TIFF. LSTM Training for Tesseract 4.0x is not supported.
JavaFX-based jTessBoxEditorFX was developed to address the existing issue of rendering complex scripts in
Swing-based jTessBoxEditor program.
</p>
<p>
jTessBoxEditorFX is released and distributed under the <a href="http://www.apache.org/licenses/LICENSE-2.0.html">Apache License, v2.0</a>.
</p>
<h3>
SYSTEM REQUIREMENTS</h3>
<p>
<a href="https://www.oracle.com/java/technologies/downloads/">Java 21</a> and <a href="https://gluonhq.com/products/javafx/">JavaFX 21</a>.
</p>
<h3>
INSTRUCTIONS</h3>
<p>
Execute the following commands to launch the program:
</p>
Windows:
<blockquote>
<code>set PATH_TO_FX="C:\Program Files\Java\javafx-sdk-21.0.1\lib"<br />
java -Xms128m -Xmx1024m --module-path %PATH_TO_FX% --add-modules javafx.controls,javafx.fxml,javafx.web -jar jTessBoxEditorFX.jar</code>
</blockquote>
Linux/Mac:
<blockquote>
<code>export PATH_TO_FX=path/to/javafx-sdk-21.0.1/lib<br />
java -Xms128m -Xmx1024m --module-path $PATH_TO_FX --add-modules javafx.controls,javafx.fxml,javafx.web -jar jTessBoxEditorFX.jar</code>
</blockquote>
<p>
You will need to provide the TIFF/Box files as input to the editor. Images to be
used in training should be of 300 DPI and 1 bpp (bit per pixel) black&white
or 8 bpp grayscale, uncompressed TIFF format; box files, encoded in UTF-8 format,
are generated by Tesseract executables with appropriate command-line options (see
<a href="https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract">Tesseract Training Wiki</a>). Or
they both can be created using the built-in <em>TIFF/Box Generator</em>.</p>
<p>
The following hotkeys are available in Box View for ease of editing:</p>
<ul>
<li><strong>W/S</strong> - move box up/down;<strong> A/D</strong> - move box left/right</li>
<li><strong>Q/E</strong> - decrease/increase box width;<strong> R/F</strong> - decrease/increase box height</li>
<li><strong></></strong> - previous/next box</li>
<li><strong>X</strong> - edit character in box</li>
</ul>
<p>
Holding Shift when using hotkeys multiplies movement speed by 10.
Pressing Enter or ESC when editing character focuses the box editor.</p>
<p>
You can reorder boxes through table row drag-and-drop operations.</p>
<p>
Note that the coordinate system used in the box file has (0,0) at the bottom-left;
on computer graphics devices, however, (0,0) is defined as top-left. jTessBoxEditorFX
uses and displays in the graphics device coordinates. The edited box files are still
read and written in proper format.
</p>
<p>
The generator produces, for a given input UTF-8 text file, a TIFF/Box pair of files
suitable for training with Tesseract. The generated image is, depending on anti-aliasing
mode enabled, a binary or grayscale, uncompressed multi-page TIFF with 300 DPI resolution.
Letter tracking, or spacing between characters, can be adjusted to eliminate bounding
box overlapping issues. Note that the coordinates of some boxes could be slightly
different (by 1 or 2 pixels) from the ones that would have been generated by Tesseract
itself; nevertheless, the generated box file can be used to validate the one created
by Tesseract with the use of a Unicode-compatible file compare tool, such as <a href="http://sourceforge.net/projects/winmerge/">
WinMerge</a>.
</p>
<p>
<span class="auto-style1">Tips</span>: Experiments indicate that the quality of
training with images created by <em>TIFF/Box Generator</em> is higher with font
sizes 12pt or greater and with some noise added.
</p>
<p>
Automated training is provided in latest version. Tesseract Windows training executables
are bundled with the program; for other platforms, you will need to <a href="https://github.com/tesseract-ocr/tesseract/wiki/Compiling">
build</a> them. Place all required source training data files, prefixed with
an appropriate language code, in a specified directory (check <code>samples</code>
folder for examples). The training operation can also be automated using the enclosed
<code>train.ps1</code> Windows PowerShell script.
</p>
<p>
The <em>Merge TIFF</em> function can save multiple images containing text of the
same font into a single multi-page TIFF file to be used for training.
A conversion function is included to convert numeric character reference (NCR) and
escape sequence in the <em>Character</em> text field to Unicode characters.</p>
<p>
If there is any question, please post in <a href="http://sourceforge.net/projects/vietocr/forums">
VietOCR Forums</a>.
</p>
<hr />
</div>
</body>
</html>