Exploratory code for PDF image mining. A multi page PDF will be split and converted to jpeg files that are mined for illustrations and images. Baed on https://github.com/megloff1/image-mining with added PDF splitting, a simple GUI and queue management.
- Make sure you have Git and Docker with docker-compose installed.
- Get the latest version of this repository:
git clone --depth 1 https://github.com/peterk/pimmer.git
. - Copy the example_env file to
.env
and edit settings. - Make sure you have a folder called
data
in the project root folder (jobs and resulting image files will end up here). You can map output to a different local folder for the worker indocker-compose.yml
. - Run
docker-compose up -d
. Wait a minute until the queue and worker is up.
The service is now running on http://localhost:7777.
If you are planning on processing a large number of documents you can start more workers with docker-compose up -d --scale worker=5
and then post files with curl to the /process/
endpoint:
curl -v --silent -F "file=@testdata/hat_catalog.pdf" http://0.0.0.0:7777/process/
Please report bugs and feedback in the Github issue tracker.
The detected images will end up as individual image files in job folders in the ./data/results.
The job folder will also contain a json file per page with the coordinates of the detected images.