node-ocr

Code necessary to replicate current OCR practice on bahai.works

Most of these files are specific to a certain periodical. Eg converter-bn.js for Baha'i News or converter-cbn.js for Canadian Baha'i News.

For books use converter-miscbahai.js

Prerequisite installation instructions

Download/clone this repository to a folder on your computer, eg: C:/node-ocr
Download node.js (https://nodejs.org/en/). During installation (windows) you do not need to install Chocolatey.
(Windows) Download and install git (https://git-scm.com/download/win). I checked only default options during installation.
In console (I use ConEmu https://conemu.github.io/ )
1. cd C:\node-ocr
2. npm i -g pnpm (assuming you didn't already have pnpm)
3. pnpm i

C:/node-ocr should now have a new folder: node_modules.

Download and install Tesseract. (Windows https://github.com/UB-Mannheim/tesseract/wiki). If you plan on ocr'ing other languages you can select those during installation. They can also be added later. You may also need to add a PATH variable, described below. You are now ready to use this package.

OCR instructions

Open the PDF you want to OCR and save each page as a PNG image (requires Acrobat Pro or similar). In my version of Acrobat: File > Save As Other > Image > PNG
Modify the script (eg. converter-miscbahai.js) paths outputPath: __dirname and getImagePath as necessary to match the location where you have saved the page images from step 1. (Note that where you see /../../Bahai.works/ that means going back two directory levels from where node-ocr is installed. If you put the images in a folder called "Images" on your desktop it would be: /../Users/[username]/Desktop/ and in the return section it would look something like return 'C:\Users\[username]\Desktop\...')
Navigate to the node-ocr directory in console: cd C:\node-ocr
Run the script from the console: node converter-miscbahai.js Images Revelation_of_Bahaullah_Vol_1 373
1. Images is the folder on your desktop
2. Revelation_of_Bahaullah_Vol_1 is the name of the PDF
3. 373 is the total number of pages in the PDF file (total number of images created in step 1).

If you get an error ReferenceError: __dirname is not defined in ES module scope then add this to the top of your converter-miscbahai.js script:

import path from 'path';
import { fileURLToPath } from 'url';
const __filename = fileURLToPath(import.meta.url);
const __dirname = path.dirname(__filename);

If you get an error tesseract is not recognized as an internal or external command you need to add a PATH variable (windows). Start menu, search "path" and find "Edit the system environment variables" then click "Environment Variables" which is at the bottom of the "Advanced" tab. In the top box find "Path" click "Edit". Click "New" and add (for example) C:\Program Files\Tesseract-OCR.

Close console and retry.

If you get an error about the file not being found, double check the path, filename, and "Start:" variables in the .js script, (eg: start: 1, to start at Somefilename_Page_1.png) and ot.pad(...) variables (described next).
The .js script variable ot.pad(i, 3) defines the number of zero's in the filename before the page number. Somefilename_Page_1.png would use ot.pad(i, 1), Somefilename_Page_01.png would use ot.pad(i, 2) and Somefilename_Page_001.png would use ot.pad(i, 3), etc.

Tesseract in other languages

To run OCR in a language other than English you will need to have downloaded the language data (https://github.com/tesseract-ocr/tessdata). For me these files are placed in C:\Program Files\Tesseract-OCR\tessdata. And then modify line 25 in this file node_modules/node-tesseract/lib/tesseract.js

If you downloaded German (deu.traineddata) then you can modify 'l': 'eng' to read 'l': 'deu'

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
.eslintrc.cjs		.eslintrc.cjs
.gitignore		.gitignore
LICENSE-MIT.txt		LICENSE-MIT.txt
README.md		README.md
converter-ab.js		converter-ab.js
converter-abold.js		converter-abold.js
converter-bahaicanada.js		converter-bahaicanada.js
converter-bahaicanadaspecial.js		converter-bahaicanadaspecial.js
converter-bbriefe.js		converter-bbriefe.js
converter-bn.js		converter-bn.js
converter-bninsert.js		converter-bninsert.js
converter-bw.js		converter-bw.js
converter-byb.js		converter-byb.js
converter-cbn.js		converter-cbn.js
converter-cbnv-test.js		converter-cbnv-test.js
converter-cbnv.js		converter-cbnv.js
converter-childway.js		converter-childway.js
converter-lop.js		converter-lop.js
converter-miscbahai-no.js		converter-miscbahai-no.js
converter-miscbahai.js		converter-miscbahai.js
converter-no.js		converter-no.js
converter-onecountry.js		converter-onecountry.js
converter-pp.js		converter-pp.js
converter-sdw.js		converter-sdw.js
converter-sw.js		converter-sw.js
converter-tc.js		converter-tc.js
converter-wo.js		converter-wo.js
converter-wu.js		converter-wu.js
converter-zfg.js		converter-zfg.js
converter1.js		converter1.js
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

node-ocr

Prerequisite installation instructions

OCR instructions

Tesseract in other languages

About

Releases

Packages

Contributors 2

Languages

License

bahaipedia/OCR-for-Bahai.works

Folders and files

Latest commit

History

Repository files navigation

node-ocr

Prerequisite installation instructions

OCR instructions

Tesseract in other languages

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages