Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhancement docker build & words and bounds results #49

Open
wants to merge 7 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 11 additions & 0 deletions .dockerignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
*.swo
*.swp
.git
.DS_Store
node_modules
bin
build
scripts
coverage
.nyc_output
.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -4,3 +4,4 @@ node_modules/
.DS_Store
yarn.lock
*.tar.gz
build
16 changes: 16 additions & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
from lambci/lambda:build-nodejs8.10

RUN yum install -y autoconf aclocal automake install libtool libjpeg-devel \
libpng-devel libtiff-devel zlib-devel wget gzip make cmakegcc freetype-devel \
gcc gcc-c++ git lcms2-devel libjpeg-turbo-devel autogen libpng-devel \
libtiff-devel libwebp-devel libzip-devel zlib-devel libgcc

RUN yum groupinstall "Development Tools" -y

RUN yum install -y cmake

COPY . .

RUN wget https://github.com/google/brotli/archive/v1.0.7.tar.gz
RUN tar -zxvf v1.0.7.tar.gz
RUN cd brotli-1.0.7 && mkdir out && cd out && ../configure-cmake && make && make test && make install
18 changes: 18 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
NAME=aws-lambda-tesseract
PWD=$(shell pwd)

build: build-docker build-tesseract compress-tesseract

build-dist: build-tesseract compress-tesseract

build-docker:
docker build -f Dockerfile -t $(NAME) .

version:
@echo $(shell git rev-parse HEAD)

build-tesseract:
docker run -it -v $(PWD)/scripts:/scripts -v $(PWD)/build:/build -v $(PWD)/build:/build $(NAME) /scripts/compile-tesseract.sh

compress-tesseract:
docker run -it -v $(PWD)/scripts:/scripts -v $(PWD)/build:/build $(NAME) /scripts/compress-with-brotli.sh
66 changes: 0 additions & 66 deletions compile-tesseract.sh

This file was deleted.

4 changes: 0 additions & 4 deletions compress-with-brotli.sh

This file was deleted.

1 change: 1 addition & 0 deletions package.json
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@
],
"dependencies": {
"@shelf/aws-lambda-brotli-unpacker": "0.0.2",
"csv-parse": "^4.3.0",
"is-image": "2.0.0"
},
"devDependencies": {
Expand Down
47 changes: 43 additions & 4 deletions readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,18 +18,49 @@ When a Lambda starts, it unpacks an archive with a binary to the `/tmp` folder a

## Usage

#### Using a path

```js
const {getTextFromImage, isSupportedFile} = require('@shelf/aws-lambda-tesseract');

module.exports.handler = async event => {
// assuming there is a photo.jpg inside /tmp dir
// original file will be deleted afterwards

if (!isSupportedFile('/tmp/photo.jpg')) {
return false;
}

return getTextFromImage('/tmp/photo.jpg');
getTextFromImage('/tmp/photo.jpg').then(result => console.log(result));
};
```

#### Using a stream

This is useful for when you want to stream the file data from a remote source like a URL.

```js
const https = require('https');
const {getTextFromImage, isSupportedFile} = require('@shelf/aws-lambda-tesseract');

module.exports.handler = async event => {
// assuming that the url exists and is readable.
const url = 'https://cdn-std.dprcdn.net/files/acc_55602/9X4IIL';
const fileStream = await new Promise(resolve => https.get(url, resolve));
getTextFromImage(fileStream).then(result => console.log(result));
};
```

#### Extracting words and their coordinates

The `getWordsAndBounds` function returns a JSON object of extracted words and their coordinates on the page.

```js
const {getWordsAndBounds} = require('@shelf/aws-lambda-tesseract');

module.exports.handler = async event => {
// assuming that photo.jpg exists and is readable.
const file = fs.createReadStream(__dirname + '/photo.jpg');

getWordsAndBounds(file).then(result => console.log(result));
};
```

Expand All @@ -38,7 +69,15 @@ unsupported by Tesseract file extensions.

## Compile It Yourself

See [compile-tesseract.sh](compile-tesseract.sh) & [compress-with-brotli.sh](compress-with-brotli.sh) files
Compile Tesseract for deployment on Lambda. Requires [Docker](https://www.docker.com/) & [Make](https://www.gnu.org/software/make/manual/html_node/Introduction.html) to be installed.

`$ make build`: Builds Docker image, compiles Tesseract 4.0.0, and compresses result into the `tt.tar.br` archive.

`$ make build-tesseract`: Compiles Tesseract 4.0.0 and creates `tesseract.tar.gz` file as output.

`$ make compress-tesseract`: Runs brotli compression on built Tesseract and compresses `tesseract.tar.gz` into `tt.tar.bz`.

**Note:** After compiling and compressing you need to copy the latest `tt.tar.bz` into the `/bin` directory. `$ cp ./build/tt.tar.bz ./bin`

## See Also

Expand Down
61 changes: 61 additions & 0 deletions scripts/compile-tesseract.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
#!/usr/bin/env bash
vladholubiev marked this conversation as resolved.
Show resolved Hide resolved

echo "Building"
# Build leptonica
wget http://www.leptonica.com/source/leptonica-1.77.0.tar.gz
tar -zxvf leptonica-1.77.0.tar.gz
ls -la ./
cd leptonica-1.77.0
ls -la
./configure
make
make install

# Build tesseract 4.0
cd ..
wget https://github.com/tesseract-ocr/tesseract/archive/4.0.0.tar.gz
tar -zxvf 4.0.0.tar.gz
cd tesseract-4.0.0/
export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig
./autogen.sh
./configure
make
make install
ldconfig

cd ~
mkdir tesseract-standalone

# trim unneeded ~ 15 MB
strip ./tesseract-standalone/**/*

# copy files
cd tesseract-standalone
cp /usr/local/bin/tesseract .
mkdir lib
cp /usr/local/lib/libtesseract.so.4 lib/
cp /usr/local/lib/liblept.so.5 lib/
# cp /usr/lib64/* lib/
cp /usr/lib64/libjpeg.so.62 lib/
cp /usr/lib64/libwebp.so.4 lib/
cp /usr/lib64/libstdc++.so.6 lib/
cp /usr/lib64/libpng15.so.15 lib/
cp /usr/lib64/libtiff.so.5 lib/
cp /usr/lib64/libgomp.so.1 lib/
cp /usr/lib64/libjbig.so.2.0 lib/

# copy training data
mkdir tessdata
cd tessdata
wget https://github.com/tesseract-ocr/tessdata_fast/raw/master/eng.traineddata

# Create configs
mkdir configs
echo "tessedit_create_tsv 1" > configs/tsv

# archive
cd ~
tar -zcvf tesseract.tar.gz tesseract-standalone
mv tesseract.tar.gz /build/

echo "Done!"
10 changes: 10 additions & 0 deletions scripts/compress-brotli.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
#!/usr/bin/env bash

tar -C /build -zxvf /build/tesseract.tar.gz
mv /build/tesseract-standalone /build/tesseract
cd /build
tar -cf tt.tar tesseract
rm -rf tesseract
echo "Running brotli (this can take a few minutes)"
brotli --best --force --verbose /build/tt.tar
echo "Done"
78 changes: 55 additions & 23 deletions src/index.js
Original file line number Diff line number Diff line change
@@ -1,30 +1,71 @@
const {unpack} = require('@shelf/aws-lambda-brotli-unpacker');
const {execFileSync, execSync} = require('child_process');
const {execFile} = require('child_process');
const path = require('path');
const fs = require('fs');
const parseCSV = require('csv-parse');
const isImage = require('is-image');
const {unpack} = require('@shelf/aws-lambda-brotli-unpacker');

const unsupportedExtensions = new Set(['ai', 'emf', 'eps', 'gif', 'ico', 'psd', 'svg']);
const inputPath = path.join(__dirname, '..', 'bin', 'tt.tar.br');
const outputPath = '/tmp/tesseract/tesseract';

async function runTesseract(file, opts) {
const ttBinary = process.env.TESSERACT_BINARY_PATH || (await unpack({inputPath, outputPath}));
let processFile = 'stdin';
if (typeof file === 'string' && fs.existsSync(file)) processFile = file;
if (!file) processFile = false;

const options = {
env: {}
};
if (!process.env.TESSERACT_BINARY_PATH) {
options.env.LD_LIBRARY_PATH =
`${process.env.LD_LIBRARY_PATH}:/tmp/tesseract/lib` || `/tmp/tesseract/lib`;
options.env.TESSDATA_PREFIX = process.env.TESSDATA_PREFIX || `/tmp/tesseract/tessdata`;
}
if (!process.env.TESSERACT_BINARY_PATH) options.cwd = '/tmp/tesseract';
return new Promise((resolve, reject) => {
const finalOpts = processFile ? [processFile, ...opts] : opts;
const child = execFile(ttBinary, finalOpts, options, (error, stdout, stderr) => {
if (error) return reject(error);
return resolve(stdout);
});
if (processFile === 'stdin') file.pipe(child.stdin);
vladholubiev marked this conversation as resolved.
Show resolved Hide resolved
});
}

function isUnsupportedFileExtension(filePath) {
const ext = path
.extname(filePath)
.slice(1)
.toLowerCase();

return unsupportedExtensions.has(ext);
}

module.exports.getExecutablePath = async function() {
return unpack({inputPath, outputPath});
};

module.exports.getTextFromImage = async function(filePath) {
const ttBinary = await unpack({inputPath, outputPath});

const stdout = execFileSync(ttBinary, [filePath, 'stdout', '-l', 'eng'], {
cwd: '/tmp/tesseract',
env: {
LD_LIBRARY_PATH: './lib',
TESSDATA_PREFIX: './tessdata'
}
});
module.exports.getTextFromImage = async function(file) {
const result = await runTesseract(file, ['stdout', '-l', 'eng']);
return result.toString();
};

execSync(`rm ${filePath}`);
module.exports.getWordsAndBounds = async function(file) {
const result = await runTesseract(file, ['stdout', '-l', 'eng', 'tsv']);
const object = await new Promise((resolve, reject) =>
parseCSV(result.toString(), {delimiter: '\t', columns: true}, (err, result) => {
if (err) return reject(err);
return resolve(result);
})
);
return object;
};

return stdout.toString();
module.exports.version = async function() {
const result = await runTesseract(false, ['--version']);
return result.toString();
};

module.exports.isSupportedFile = function(filePath) {
Expand All @@ -35,12 +76,3 @@ module.exports.isSupportedFile = function(filePath) {

return !isUnsupportedFileExtension(filePath);
};

function isUnsupportedFileExtension(filePath) {
const ext = path
.extname(filePath)
.slice(1)
.toLowerCase();

return unsupportedExtensions.has(ext);
}