This repository contains a simple containerized API to convert PDF documents to text using Mozilla's pdf.js and pdf.js-extract.
The image is available on Docker Hub under the name codeinchq/pdf2txt
.
By default, the container listens on port 3000. The port is configurable using the PORT
environment variable.
All requests must by send in POST to the /extract
endpoint with a multipart/form-data
content type. The request must contain a PDF file with the key file
.
Additional parameters can be sent to customize the conversion process:
firstPage
: The first page to extract. Default is1
.lastPage
: The last page to extract. Default is the last page of the document.password
: The password to unlock the PDF. Default is none.normalizeWhitespace
: If set totrue
, the server normalizes the whitespace in the extracted text. Default istrue
.format
: The output format. Supported values aretext
(the server returns the raw text astext/plain
) orjson
(the server returns a JSON object astext/json
). Default istext
.
The server returns 200
if the conversion was successful and the images are available in the response body. In case of error, the server returns a 400
status code with a JSON object containing the error message (format: {error: string}
).
docker run -p "3000:3000" codeinchq/pdf2txt
Convert a PDF file to text with a JSON response:
curl -X POST -F "file=@/path/to/file.pdf" http://localhost:3000/extract -o example.json
Convert a PDF file to text:
curl -X POST -F "file=@/path/to/file.pdf" http://localhost:3000/extract
Extract a password-protected PDF file's text content as JSON and save it to a file:
curl -X POST -F "file=@/path/to/file.pdf" -F "password=XXX" -F "format=json" http://localhost:3000/extract -o example.json
A health check is available at the /health
endpoint. The server returns a status code of 200
if the service is healthy, along with a JSON object:
{ "status": "up" }
A PHP 8 client is available at on GitHub and Packagist.
This project is licensed under the MIT License - see the LICENSE file for details.